Dealing With the Uncertainty of Big Data Security
With the amount of data being produced, we need to figure out how secure is secure enough
7/9/2014 12:12:24 AM |
By Reg Harbeck
Red Green had it right: Three words we find hard to say are, “I don’t know.”
On the mainframe, we’ve taken a long, careful journey of learning about properly securing sensitive data to make it available as much as required, but not more. Every piece of data has an owner and/or steward, every access is explicitly granted and the implications of an unauthorized exposure are generally understood.
But what do you do when securing big data from many disparate sources, combined into a single massive pool, with more possible security implications than you can predict or pre-empt?
At SHARE in Pittsburgh
this August, Brian Cummings and I will be presenting about this question, and look forward to your input.
What are the potential sources of the big data we deal with, both on and off the mainframe? Of course, there’s established structured database data—generally accessible via the relational model, whether or not stored that way. And multiple sources of such structured data may be joined together, either formally using means such as SQL commands, or generically by being poured into a common pool of data.
You’re probably thinking: No way is that important data finding its way into the pool with generic data from other sources. And ideally you have a management-supported strategy for preventing that from happening. But with so much corporate data being served up from centralized systems such as mainframes, through multiple web-front-end applications, and the results of those interactions often being mashed up through additional applications, it requires a very high degree of awareness and control to be certain your most important and confidential data isn’t somehow making its way into a larger pool of multi-sourced big data.
Additionally prone to be poured into such a pool are data from unstructured sources, such as blogs, reports, wikis, application screen scraping and web search results—both external from the Internet and from more confidential document sources in an organization’s intranet.
Then there are non-character data such as audio, picture and video, not to mention the output of technologies capable of drawing meaning from them such as speech, character and facial recognition.
In and of themselves, each of these sources has varying degrees and clarity of confidentiality, sensitivity and validity. But once they begin to be grouped together, the hard edges begin to disappear and the only safe thing seems to be to refuse anyone any access to the results. But that’s not the world we live in. Instead, such conglomerations of big data are rapidly becoming the norm, sometimes with much less oversight and control than any security professional could willingly accept.
The consequences of inappropriate exposure of such data are unpredictably severe, complicated further by legal and regulatory implications. But the opportunities to be gained from managing and using it effectively are often irresistible.
If you’re still reading at this point and haven’t resigned your computing security-related job to become a hermit, then you’re clearly hoping something can be done about this. Fortunately, as in much of our experience of computing, there are lessons to be learned from other parts of life, philosophy and science.
So, let’s take a look at some of this transferrable wisdom as we seek to get on top of this matter. Specifically, we’ll look for insights from:
- Heisenberg’s Uncertainty Principle
- Gödel’s Incompleteness Theorem
- The Halting Problem
- Schrödinger’s Cat
Because this article is primarily about securing big data, not these other fields of study, I’ll first briefly summarize the relevant aspects of each of these, and then review how we can map what we learn to how we approach securing big data.
Heisenberg’s Uncertainty Principle
What this principle of quantum physics essentially tells us is that you can’t measure perfectly down to the lowest possible level of detail.
Likewise, with vast pools of big data from many sources, it can become impossible to perfectly tease apart all the security implications of every piece of information.
The first answer to this is to know the provenance and security relevance of each data source before combining, and continue to respect that as the pool fills. However, the moment it begins to be used in combination with other data in the pool and to generate new conclusions, keeping a tight security grip on the results can be like tracking the water from a glass that has been dumped into a swimming pool. It can become prohibitively expensive at best, impossible at worst.
Relinquishing responsibility for this can seem like an attractive path, as provable ignorance of circumstances and implications is often used as a defense—but normally only after something has gone terribly wrong. And as we’ll see shortly, it’s not possible to predict everything that might go wrong.
Gödel’s Incompleteness Theorem
Essentially, what this mathematical philosophy tells us is that a sufficiently complex system of expression is capable of containing assertions that are not decidably true or false, such as, “this phrase is false,” which is true if and only if it is false.
When dealing with big data and the insights that emerge from it, the sensitivity of a given piece of information can also be genuinely undecidable, particularly when combined with the cost of securing it. Where the data is unstructured, it may even contain claims of its own confidentiality which directly contradict what a careful assessment might conclude.
The Halting Problem
In theoretical computer science, a very simple construct called a Turing Machine (named after Alan Turing, a father of modern computer science) is used to model the idea of the simplest possible computer programs. But it was found that even at this level of simplicity, the only way to know if and how an arbitrary Turing machine will finish running is to run it.
Likewise, with big data, there are no guaranteed heuristics for securing every piece of information properly. Regardless of how carefully and exactly you try to secure it, there is permanent likelihood that some new security exposure will eventually emerge, and you’ll have to figure out how to respond to it after the fact.
Of course, that’s life. In fact, that’s how our immune systems operate: reactively as well as proactively. And the journey has no definable or discoverable end. You have to keep going.
This favorite illustration of quantum physics tells us that, when you can’t measure something exactly, such as whether a given cat inside a given closed box is alive or not, you have to calculate by probabilities instead.
So, with big data, you may not be able to examine every last record, document and combined result, but you can assign probabilities of security and then fine-tune over time, using experience to assign better likelihood models.
Stay Active in Security
A natural conclusion might be that if all this is so, then it’s better not to generate or use big data from a security standpoint. One could have said the same thing about Wi-Fi a decade ago. But it doesn’t matter, people are doing it anyway.
And, amazingly, for all the security implications, it’s mostly working out, as we take the journey of discovering how secure is secure enough—with plenty of hard knocks along the way.
Yes, it’s true that we are in for some bumps in the road—but it’s always been an illusion to think that perfect security existed. It’s time to get over that illusion and face practical realities, and learn from them.
After all, a perfectly locked-down organization that is unable to take advantage of the next big opportunity will go out of business, just like one that doesn’t pay enough attention when securing their data and monitoring and adjusting.
Like a bicycle, what’s needed is the balance you get when moving forward.
The best way to predict the future is to make it happen. So stay alert and dynamic, remember that security is about an organization’s success, not the other way around, and take the journey.
And please include our SHARE session
in your journey, where Brian Cummings will add important insights about the Big Data Value Chain and parallels to other environments such as research and development on Friday, Aug. 8, from 10 to 11 a.m. in room 405.
Reg Harbeck has been working in IT and mainframes for more than 25 years, and is very involved in the mainframe culture and ecosystem, particularly with the SHARE Board and zNextGen and SECurity projects. He may be reached at Reg@Harbeck.ca.