Size doesn't matter

Oct 03, 2014

Don’t just ask what big data can do for you; ask what you can do for your data.

Why big data can actually mean big problems in information security, why we tend to get lost when we're "getting high" (mathematically), and why it is far better to have the right data than to have big data.

Last night I set down with my friend Mark Grundland who is a data scientist, mathematician, and frankly one of the smartest and kindest human beings I ever encountered in my life. I showed Mark Alex Hutton slide deck from his RVAsec talk in 2013 called “Towards A Modern Approach To Risk Management” which he loved. When he reached the part of the deck where Alex was talking about “big data” and Hadoop he tried to explain to me what is the mathematical challenges of “big data”. After a minute or so I stopped him, asked him for permission to record it and here is a revised and enhanced transcript of his explanation:

Getting "high"

In any predictive analytics problem, it is far better to have the right data than to have simply more data. That is why big data is a problem and not a solution, at least not by itself, especially when big data essentially means that you measuring the behaviour of a complex dynamic system, over time, according to a large number of dimensions.

While it may be fundamentally accurate and comprehensive to have as many measurements as possible, very quickly when the number of dimensions exceeds a few dozens, it reaches a point when simply adding more dimensions (more types of distinct measurements to the dataset) makes the data more difficult to interpret without arbitrarily imposing some kind of summarization to present a seemingly more clearly meaningful perspective. Why? Because in high dimensions all data is sparse. It no longer makes sense to derive a probability distribution because the space in which the observations are being found is almost utterly empty.

In high dimensions, basic mathematical notions of angle and distance no longer work in the ways we normally take for granted, so our ability to make mathematically meaningful conclusions based on them breaks down. Consider a hypersphere, the region of space less than a given distance to a point at its centre. As the number of dimensions increases, the volume of a hypersphere approaches zero. That is why in high dimensions it is rare to find any points that are genuinely close to each other. As density becomes difficult to measure, how can we use the frequency of observed occurrence to assess probabilities? As the number of dimensions grows, the number of observations required to estimate their probability distribution grows exponentially. In high dimensions, even big data turns out to be far too small. Probability means very little in empty space. And yes, of course, statistics can always tells us a bedtime story, but it will be a story more about all the assumptions we religiously uphold rather than the data we actually observe, because in high dimensions there is often no way to connect the dots without assuming that we already know something of the structure of their pattern.

Similarly, in high dimensions, many of the common mathematical tools that we have to describe whether two things are alike or different simply no longer apply. What we are left with, if we want to create meaningful measures of distance and of similarity between data points, which are measured according to a great number of features, is that we have to recognize that in this high dimensional space in which we capture the measurements there is a much lower dimensional manifold – meaning a surface area, a region, where most of the observations typically occur. Only in this low dimensional manifold we can actually bring similar things to be closer together and different things to be further apart.

When we measure a data point in more ways what becomes ever more crucial is not “in how many ways I can measure a data point” but how you take all these ways and summaries these individual measurements into a form you can understand. Some of these summarizations can be made based on understanding the problem domain, which is often a preferable approach. Otherwise these summarizations to a certain degree at least can be made by examining the distribution of the data points you observed so far, doing a more or less sophisticated version of principle component analysis. What you want to do is to capture as much of the variability in as few measurements as possible. It would be very good if what you wanted to do were to understand the mean, which means – what typical observation are like.

However, with the automated techniques of dimensionality reductions, meaning going from hundreds or thousands of dimensions to a low number of dimensions where distance and angles and spatial relations are more meaningful again, the problem is that most of these techniques assumes that you want to preserve relationships between typical data points. But in certain applications such as anomaly detection this is patently false – what you want to do is to discover which data points are not like others. This is one of those things that are inherently difficult and poorly defined in the following sense: in high dimensions every data point is distinct – if you measure enough things, there would be some combination of measurements that will distinguish this event from all the others (unless it’s an exact clone).

In a low dimensional projection of a high dimensional set of measurements what happens is that the very characteristics that may distinguished an unusual event are being summarized out of existence and therefore they are not taken into account – they occur so infrequently that they are not considered significant enough to be taken into consideration when summarizing the data. This is why anomaly detection in high dimensional spaces is so darn hard.

Projections

There are things you can at least try. Let us start with random projection, one of the very simplest things to do when confronting very high dimensional data whose structure is unknown. For the sake of simplicity, let us assume that all of the dimensions are continuous. This also works even with discrete measurements, as for instance you can always represent the absence or presence of a feature by assigning a value of zero or one. First, you shift your distribution to be centred, so its centre of mass is at the origin. Next you pick a random direction, meaning a random point on the surface of a unit sphere. For instance, take the random direction to be the unit vector parallel to a random point drawn from a standard normal distribution in the high dimensional space, whose coordinates can be obtained by picking values from a one-dimensional standard normal distribution. Next, you project the data on to the axis corresponding to this random direction. When you project that data onto that axis, you can readily obtain an accurate estimate of an empirical probability distribution for points on that axis, assigning a valid probability for any event you observed and any new event you’re going to observe. When you do this very simple procedure a great many times and then you obtain a range of valid probability distributions representing the dataset from a range of random perspectives. Most importantly, you have not imposed any preferred perspective or assumed structure on the data. Now, you wish to determine whether a point of interest is anomalous. For most of these random projections, the point of interest will appear somewhere in the middle of a normal distribution, as you would expect based on the central limit theorem. However, if you get one or more random projections where you find that the point of interest is located far in the tail of the distributions, meaning that it has a low probability, then this is strong evidence that something unusual is going on. The nice thing about this very simple and widely applicable method is that, because the projected probability distributions can be represented by simple one-dimensional histograms, it is completely non-parametric and can be implemented to run very efficiently on huge datasets. A random projection is really one of the fastest things you can do with big data. “Is there any direction that this point is anomalous?” is a pretty good way of asking “Is this an anomalous point?”. In fact, if there are many such directions you can be fairly certain that it is an anomalous point. Otherwise, you can confidently say, “Considering the 100 dimensional data set from a 1000 different points of view, each one presenting an independently chosen random perspective on the data, no evidence for an anomaly was found.”

Being ignored

There are of course many more sophisticated approaches one can try, but the simple fact of the matter remains unchanged, the more measurements you take the harder it can be to reason about what it is that you are actually measuring. That is why it is really worthwhile, if you can, to measure the right things to begin with. The problem that we have is that when we try to reduce all of the dimensions we start to ignore the abnormalities, all the really weird and interesting stuff, you know, the stuff that would make us question our deepest and dearest assumptions. It is simply in the nature of the dimensionality reduction game that we tend to capture the typical relationships by ignoring the atypical ones. And that is perfectly fine, unless you are looking for the anomalies, the exceptions that break the rules.

Subsets

Alternatively, what you often tend to do is to introduce a more rule-based system. Rather than trying to consider all the dimension at once, you break them down into subsets, and you reason for each subset – am I typical or not typical, and when you look at all the subsets you ask “are they all atypical? Is there a critical one which is atypical?” You take your measurements, you bundle them together, preferably in a domain dependent way, and then for each bundle you have an anomaly score, preferably based on a probability. Then you need a way of combining these scores to take into account the relationships between them. This is more like fuzzy logic type system that incorporates some domain knowledge. The intelligence is in how you choose to summarize all these measurements, combine them together to the point that you can reason about. The projection needs to be based on some idea of what the domain you’re measuring should meaningfully mean, that you’re measuring things you understand.

Breaking hierarchies

I will add one more point (my understanding):

When we talk about complex, dynamic systems the difficulty is that anything that impacts the hierarchy disturbs our model and our ability to detect abnormalities. It can be changes in our processes, infrastructure, or when a new way of hacking is introduced – all of it can impact the hierarchy. We can develop an understanding (hierarchy) of what we know, but it is very hard to identify as abnormality when things are so fluid.

Final Words

So that's it. Size doesn't matter - it's what you do with whatever you have that matters. What a great lesson mathematics can teach us about information, about life, about love. Thank you Mark.

Blessed we are

Namaste

Sense of Awareness