August 15, 2013

Big data: turning hay into needles

Here is a quick riff on an analogy. Small data analysis is all about looking for needles in haystacks. Big data analysis is all about turning hay into needles (or rather turning hay into something that achieves what it is we used needles to do).

Being more specific. Small data analysis (i.e. the only form of data analysis we have had to date) was a reductive process – like everything else in the world where the data and information channels were likewise restrictive, largely as a result of their cost of deployment. Traditional marketing, for example, is the art of the reduction – squeezing whole brand stories into 30 second segments in order to utilise the expensive distribution channel of TV. Academic analysis likewise – squeezing knowledge through the limited distribution vessel that is either an academic or a peer-reviewed publication.

As a result the process of data analysis was all about discarding data that was not seen to be either relevant or accurate enough, or reducing the amount of data analysed via sampling and statistical analysis. The conventional wisdom was that if you put poor quality data into a (small) data analysis box – you got poor quality results out at the other end. Sourcing small amounts of highly accurate and relevant data was the name of the game. All of scientific investigation has been based on this approach.

Not so now with big data. We are just starting to realise that a funny thing happens to data when you can get enough of it and can push it through analytical black boxes designed to handle quantity (algorithms). At a certain point, the volume of the data transcends the accuracy of the individual component parts in terms of producing a reliable result. It is a bit like a compass bearing (to shift analogies for a moment). A single bearing will produce a fix on something along one dimension. Take another bearing and you can get a fix in two dimensions, take a third and you can get a fix in all three dimensions. However, any small inaccuracy in your measurement can produce a big inaccuracy in your ability to get a precise fix. However, suppose you have 10,000 bearings. Or rather can produce a grid of 10,000 bearings, or a succession of overlapping grids, each comprised of millions of bearings. In this situation it is the density of the grid, the volume of the data and, interestingly, often the variance (or inaccuracies) within the data that is the prime determinant of your ability to get an accurate fix.

To return to haystacks, it is the hay itself which becomes important – and rather than looking for needles within it it is a bit like looking into a haystack and finding an already stitched together suit of clothes.This is why big data is such an important thing – and also why a big data approach is fundamentally different to what we can now call small data analysis. It is also why there is now no such thing as inconsequential information (i.e. hay) – every bit of it now has a use provided you can capture it and run it through an appropriate tailoring algorithm.

Share this: