The "Big Data" Revolution? Really?

One of the great ironies of the “big data” REVOLUTION is the way in which so much of the insight we draw from these massive datasets actually comes from small samples not much larger than the datasets we have always used. A social media analysis might begin with a trillion tweets, use a keyword search to reduce that number to a hundred million tweets and then use a random sample of just 1,000 tweets to generate the final result presented to the user. As our datasets get ever larger, the algorithms and computing environments we use to analyze them have not grown accordingly, leaving our results to be less and less representative even as we have more and more data at our fingertips. What does this mean for the future of “big data?”

Stepping back from all of the hype and hyperbole, there is considerable truth to the statement that we live in an era in which data is valued sufficiently that we believe it worth the time and expense to collect, store and analyze it at scales that significantly exceed those of the past.

It is just as true that many of our beliefs regarding the size of the datasets we use are absolutely wrong. In particular, many of the “Standard System” of the big data era that we hold up as benchmarks of just what it means to work with “big data” like Facebook and Twitter, are actually vastly smaller than we have been led to believe.

In many ways, much of the size of the “big data” revolution exists only in our imaginations, aided by the reality distortion field of the big web companies that tout their enormous scale without actually releasing the hard numbers that might lead us to call those claims into question.

Most troublesome of all, however, is the way in which we analyze the data we have.

Computing power continues to increase and in today’s world of GPUs, TPUs, FPGA’s and all other manner of accelerators, we have no shortage of hardware on which to run our analyses. The problem is that despite nearly unfathomable amounts of hardware humming away in the data centers of the big cloud companies, we are still just as hardware constrained as we have always been.

We may have vast amounts of hardware, but the datasets we wish to analyze are even larger. We obviously aren’t lacking for the hardware to process petabytes or even tens or hundreds of petabytes.

The problem is that when it comes to big data analyses there seems to be a tremendous gulf between the companies performing population-scale analyses using tools like BigQuery that analyze the totality of their datasets and return absolute results and the rest of the “big data” world in which estimations and random samples seem to dominate.

Estimations are particularly prevalent in spaces like social media and behavioral analysis.

In a world in which all it takes is a single line of SQL to analyze tens of petabytes with absolute accuracy, why is estimation so popular?

Partially the answer is our preoccupation with speed over accuracy. Why wait minutes for petascale analyses when you can wait seconds for a random sample that may or may not bear even the slightest resemblance to reality?

In turn, our ability to tolerate immense error in our “big data” results often comes from the fact that the consequences for bad results are minimal in many “big data” domains. An analysis of the most influential Twitter users by month for a given topic over the last few years doesn’t really have a “right” answer against which an estimation can be compared. Moreover, even if the results are entirely wrong, the consequences are minimal.

Partially the answer is that our more complex algorithms have not kept pace with the size of the datasets we work with today. Many of the analytic algorithms of greatest interest to data scientists were born in the era of small data and have yet to be modernized for the size of data we wish to apply them to today. Few mapping platforms can perform spatial clustering on billions of points, while graph layout algorithms struggle to scale beyond millions of edges.

In short, the volume of data available to us has outpaced the ability of our algorithms to make sense of it. In the past we had low volumes of very rich data. Today we have high volumes of very poor data, meaning we have to process much more data to achieve the same results.

When our algorithms aren’t scalable, we are left processing the same amount of data as before, but that sample is less and less representative of the whole.

It isn’t just a question of speed. Many of our most heavily used algorithms were never designed to work with large datasets even when speed is not of concern. Few graph layout algorithms can do much to extract the structure of a dense trillion-edge graph even if given limitless computing power and as much time as they need to complete.

In short, we are confronted with the paradox that the more data we have, the less representative our findings are due to the need for sampling.

Platforms like BigQuery are slowly changing the algorithmic side of that equation. As they move beyond their reporting roots towards providing high level analyses like geospatial analytics, they are beginning to externalize the kind of massive algorithmic scalability that Google uses internally to bring more and more algorithms into the scalable cloud era. As these trends progress, algorithmic scalability will steadily become less of a limiting factor for common use cases.

Perhaps the biggest issue is that we need to stop treating “big data” as a marketing gimmick.

As companies have begun to market themselves in terms of the size of the datasets they hold and analyze, we’ve let go of the idea of actually understanding anything about the data and algorithms we’re using.

Even the most rigorous data scientists freely accept the idea of reporting trends from Twitter without having any idea just what the full trillion-tweet archive from which they are working actually looks like. In fact, few data scientists that work with Twitter even know there have been just over one trillion tweets sent.

Understanding your denominator used to be something that was considered sacrosanct to data science. Somehow, we’ve reached a point where we just accept reporting findings from analyses where we no longer have a denominator.

Once we start treating “big data” analysis as a methodologically, algorithmically and statistically rigorous process based on well understood data, we recognize the need for population-scale analyses with guaranteed correctness. We turn to platforms like BigQuery to run our analyses and focus on accuracy and completeness. We modernize the algorithms we need. We relentlessly test our results.

Putting this all together, it is ironic that in a world drowning in data, we have increasingly turned to sampling to render our “big data” small again. As our datasets have grown in size, we have sampled them ever more aggressively to keep the actual amount of data fed to our analyses and algorithms roughly the same. Whereas in years past our analyses might have been based on the totality of a dataset, today that same analysis might consider just one ten thousandth of one percent of that dataset. As we have ever more data, our results are becoming ever less representative.

In short, our so called “big data” revolution has actually resulted in less understanding of our world through less representative data than ever before.

In the end, we need to stop treating “big data” as a marketing slogan and bring ourselves back to the era where we actually cared about the quality of our data analyses before we undermine the public's trust in the power of data.

To view or add a comment, sign in

Others also viewed

Explore content categories