Data: Big, Really Big, Damn Big

Michael Wash

Published Jan 26, 2016

“Big” is one of those words that drive detail-oriented people crazy. Big is a relative term and one that reminds me of a classic scene from a Crocodile Dundee scene where he and Sue are approached by young man attempting to rob them. The young robber pulled out a knife – probably one that was big to him. This prompts Dundee to pull out his knife and said, “That’s not a knife. THIS is a knife,” one that was substantially larger. It is all relative.

I offer this definition for Big Data that I have used to consider the associated impacts: datasets whose size is beyond the ability of typical software tools to capture, store, manage, and analyze within a tolerable elapsed time. This is good, but it is also relative to one’s perspective. For example, back in the early 1800s, Thomas Jefferson had acquired the largest personal collection of books in the United States. He sold his library to Congress as a replacement for the collection destroyed by the British during the War of 1812. Congress purchased Jefferson’s library for $23,950 in 1815. This represented a big dataset and consisted of 6,487 volumes – it was stored in a building. By today’s standards, this is hardly big.

A few years ago while I was at the National Archives, I did an analysis to estimate what the digital footprint for the analog record collection stored there. This was important since we were looking to size electronic systems to be able to support these records, once converted to a digital format. At the time, these records were characterized as:

12 billion pages
18 million maps
50 million photos
550 thousand artifacts
360 thousand films
Other records (audio, etc.)

The digital preservation community has developed standards for converting analog materials into a digital format and minimizing losses that result from this conversion process. Using these standards, and the number and type of records in the holdings, the estimated digital size was approximately 3 Exabytes (3 million Gigabytes). Big. When you realize that the Archives only receives only 2-3% of data created within the government, this means the government created over 100 Exabytes. Additionally, the time between when the Archives gets records compared to when they were created is sometime 2-3 decades, meaning that there is a significant pipeline of record data on its way to the Archives.

IDC projects that data is growing at a compounded growth rate of greater than 32%/year. Some projections are that the growth rate is closer to 40%/year. Using 32% annual growth, the record data coming to the Archives will approach 14 Zettabytes (14 trillion Gigabytes). Really Big.

Clearly, we are awash in data, and will continue to be. Dealing with these data requires improved analytical methods and reporting methods. Some things to consider:

Enhance analytics to deliver more timely and accurate results on structured and unstructured data
Enhance analytics for data consisting of images, sound, and motion
Develop near immediate, high availability publishing capabilities to automatically supply relevant information

Srini Krishnamurthy 10y

Yes..now is data growth is accelerating to double every year.

Sudhir Goel 10y

How true.. Reading somewhere that 98% of the world's data was collected in the last 3-4 years. It is not just that we are seeing huge number of enrollments on Affordable Care Act (ACA), it is just that the various components of the system are producing large amount of data which CMS wants to analyze.

Gregg Melanson 10y

So true Mike. Nice article.

See more comments

To view or add a comment, sign in

Data: Big, Really Big, Damn Big

Michael Wash

More articles by Michael Wash

Others also viewed

The role of collective intelligence in the age of big data

Data Lakes

Big Data: A Shift to Multidimensional Consciousness

CODATA Collection in Zenodo: Recent Reports

El Datarado - The Lost City of Data Gold

9 Distance Measures in Data Science

Big Data, is it really that Big?

Solving The Biggest Problems Of Big Data

What's The Big Data: 12 Definitions

Open data value generation cycle

Explore content categories