Data Volume - Simple Conversation

Data Volume - Simple Conversation

Volume :

I'm not going to go into a lot of detail on the topic here, since this is not the forum to deliver a white paper on the topic. However, I would like to surface some personal thoughts in hopes to spark a conversation.

How much data are we talking about?

The volume aspect of data is simply a measurement of how many mega-, giga-, peta-, zetta-, or more bytes comprise the data artifact(s) in total. The volume can be a measurement of new data being acquired via a feed process, it can be the total footprint of existing data already within a system, or it can be a more ecosystem measurement. The volume of a new data feed is 100GB, while the volume of the data footprint for the system is 4TB, while the ecosystem’s volume is 2PB.

Focusing on the storage aspect of data, with volume comes cost. Even as the mantra of “storage is cheap”, along with the help of data compression algorithms to offset the actual size of the data versus storage footprint, there are still real dollars around storing data in large amounts. With the increase in deployments of systems which leverage IoT, Telemetry, logging, video, imaging, and other large data services, global data is gathered in fantastic volumes every second.

To illustrate the expected growth in data, I found the chart provided by Statista:

No alt text provided for this image


• Total data volume worldwide 2010-2025 | Statista

The key point of the chart being, by 2025 the “global data creation is projected to grow to more than 180 zettabytes”. In further reviewing the information provided in the chart, the projected annual average data growth rate percent by year seems to hold around ~28% from 2015 – 2025. While starting in 2020 (64.2ZB) and projecting through 2025 (181ZB), the data growth projected would be an almost ~3 times increase in the amount of data being created in just 5 short years. Even at these incredible data growth projections, I believe that the totals are still on the low side.

Using the numbers provided by Statista, for the year 2020, the potential storage cost of just the new data created would be ~$1B, assuming all the data created was being stored somewhere. As more and more companies use data in more interesting ways, to feed their data driven insights, the amount of data being retained over multiple years will increase.

Unlike the transitory nature of the processing layer of a systems, where memory and CPU reign as the focal point, data layers persist over time. The data layer continues to be created, gathered, analyzed, visualized and most importantly stored, however this life blood of business still goes mostly unmanaged. As data is stored, for various reasons, the new layers being acquired each day are added to the existing volumes thus continuing to add to the growing volumes. To add to this growing ecosystem, each layer of data tends to become disconnected from the old, even as they are piled on top of each other over time.

Even though data is stored in an ecosystem, and thus adding to the bloat of the data environment, the necessary metadata which would provide context, ownership, purpose and/or definition over time, is mostly absent. As the annual new world wide data grows to the monstrous +181ZettaBytes how much of it will be stored and provide real value, while others of it will be stored just in case. How much of the data will be usable even after 6 months, let alone a year or more, and how much of it will just be lost because it's no longer understood. How much of the data will be combined in incorrect ways because the knowledge of it's lineage was lost in some employee departure? With this massive investment in data volume, where is the investment in persisted metadata that would provide it clear value over time?

I know I started the conversation with a focus of data volume, but the point I’d like to persist is the need for governance and metadata which will allow for proper data management in this massive data ecosystem. Data is a persisted resource, and as data volumes increase, there is a real need to be able to clearly identify the data, it’s properties and characteristics so that it can be properly managed and secured through automation, thus potentially saving businesses millions of dollars in data storage costs annually. The metadata which would allow for businesses to use and govern their data beyond just today, or this month.

To view or add a comment, sign in

More articles by Dean Busenbark

  • Senior Security Operations Engineer

    Do you love the excitement and learning opportunity to study, analyze, and deal with the most complex threats to…

  • Data Entropy

    What is Data Entropy and why should I care? This was one of the things I considered when I first ran into the concept…

  • Single version of the Truth?

    Is data truthful? Or is it just factual? I believe there is a lot of information provided which talks about providing…

  • The many 'V' attributes of Data

    How many words starting with 'V' can be found to characterize the attributes of data? Anyone who is interested about…

    1 Comment

Others also viewed

Explore content categories