Data: Big, Really Big, Damn Big

Data: Big, Really Big, Damn Big

“Big” is one of those words that drive detail-oriented people crazy.  Big is a relative term and one that reminds me of a classic scene from a Crocodile Dundee scene where he and Sue are approached by young man attempting to rob them.  The young robber pulled out a knife – probably one that was big to him.  This prompts Dundee to pull out his knife and said, “That’s not a knife. THIS is a knife,” one that was substantially larger.  It is all relative.

I offer this definition for Big Data that I have used to consider the associated impacts: datasets whose size is beyond the ability of typical software tools to capture, store, manage, and analyze within a tolerable elapsed time.  This is good, but it is also relative to one’s perspective. For example, back in the early 1800s, Thomas Jefferson had acquired the largest personal collection of books in the United States. He sold his library to Congress as a replacement for the collection destroyed by the British during the War of 1812. Congress purchased Jefferson’s library for $23,950 in 1815. This represented a big dataset and consisted of 6,487 volumes – it was stored in a building.  By today’s standards, this is hardly big.

A few years ago while I was at the National Archives, I did an analysis to estimate what the digital footprint for the analog record collection stored there.  This was important since we were looking to size electronic systems to be able to support these records, once converted to a digital format. At the time, these records were characterized as:

  • 12 billion pages
  • 18 million maps
  • 50 million photos
  • 550 thousand artifacts
  • 360 thousand films
  • Other records (audio, etc.)

The digital preservation community has developed standards for converting analog materials into a digital format and minimizing losses that result from this conversion process.  Using these standards, and the number and type of records in the holdings, the estimated digital size was approximately 3 Exabytes (3 million Gigabytes). Big.  When you realize that the Archives only receives only 2-3% of data created within the government, this means the government created over 100 Exabytes.  Additionally, the time between when the Archives gets records compared to when they were created is sometime 2-3 decades, meaning that there is a significant pipeline of record data on its way to the Archives.

IDC projects that data is growing at a compounded growth rate of greater than 32%/year.  Some projections are that the growth rate is closer to 40%/year.  Using 32% annual growth, the record data coming to the Archives will approach 14 Zettabytes (14 trillion Gigabytes).  Really Big.

Clearly, we are awash in data, and will continue to be.  Dealing with these data requires improved analytical methods and reporting methods.  Some things to consider:

  • Enhance analytics to deliver more timely and accurate results on structured and unstructured data
  • Enhance analytics for data consisting of images, sound, and motion
  • Develop near immediate, high availability publishing capabilities to automatically supply relevant information

Yes..now is data growth is accelerating to double every year.

Like
Reply

How true.. Reading somewhere that 98% of the world's data was collected in the last 3-4 years. It is not just that we are seeing huge number of enrollments on Affordable Care Act (ACA), it is just that the various components of the system are producing large amount of data which CMS wants to analyze.

Like
Reply

So true Mike. Nice article.

Like
Reply

To view or add a comment, sign in

More articles by Michael Wash

  • Data Archiving

    Data continues to grow at an astounding rate as referenced in one of my earlier post. This growth is a result of…

    2 Comments
  • Business Intelligence

    About 8 years ago while our team at the Government Printing Office (GPO) was well along in the development of the…

    4 Comments

Others also viewed

Explore content categories