How BIG is your data?

Ronald Reck

Published Apr 19, 2017

By Ronald P. Reck 20170419

Fast data is the new big data. Currently, the volume of data has been the big challenge for those trying to understand what computers are doing and what is really going on behind the scenes. Every day data warriors employ strategies like Hadoop and Spark to tackle problems the best way they know how.

Sadly, this is based on an aged paradigm. Everyone has a different definition for the term BIG DATA. The real definition is, in essence, enough data that it can be overwhelming. One thing to know is that the sheer number of computers and devices connected to networks has surpassed the number of people on the planet. Combine that with the fact that new sources for data are growing quicker than the number of people. This says the internet is increasingly about nodes communicating and less about people doing so.

What does this mean?

It means that the characterization of BIG DATA will change and that streaming analysis will be the only really viable strategy for dealing with the increased velocity of new data. The past paradigm that uses aggregation and analysis will be recognized for what it is, quaint and antiquated. Think "irradescent bulb".

We will still have the volume problem but once we recognize why we really have it, we will forced to modify what is necessarily a failing strategy. Do you wonder what this new strategy looks like ?

DECISIONS NOT DATA

It is likely that making data available through an API or for download will be replaced by making an analysis, synthesis or decision available. Think about credit scores, when someone is making a credit decision, they don't get your entire credit history, merely a single number they can base their decision upon. The future will have more situations similar to biological systems where information from sensory organs is reduced for the purpose at hand. We will hand off decisions not data.

QUALITY

Anyone experienced with data analysis will agree that analysis is largely data preparation not interpretation. Just like "painting" is really about taping, edging, drop cloths and supplies, data analysis is really about strategies for compensating for poor data quality. This can involve any of the six dimensions: completness, conformity, consistency, accuracy, duplication, and integrity. https://www.whitepapers.em360tech.com/wp-content/files_mf/1407250286DAMAUKDQDimensionsWhitePaperR37.pdf

In the end, it cannot continue like it is now.

In the future, limitations in data quality will be increasingly intolerable and data's suitability will be evaluated based on its quality. Data providers will necessarily perform the validation tasks because consumers simply will not be able to do it fast enough. Quality will be the hallmark by we which judge data sources, not the afterthought it is now.

TRUTH VALUE

Anyone who has dealt with large enough datasets especially from different sources knows that conflicts in information is inevitable. Most datamodels are flat and treat all data as ground truth. Everything is modeled as being known equally well, even though that is not accurate. For example, people agree that newer information is more valuable that old information (probably based on the open world assumption) but don't encode the data that way. Eventually, we will be forced to keep track of how well we know something to make probability based decisions. Veracity, otherwise known as truth value will be based on validation.

Don Haas 9y

Good points. I just rolled off a Big Data project - quality and velocity were front & center.

How BIG is your data?

Ronald Reck

More articles by Ronald Reck

Others also viewed

BIG DATA

Data Scientists: Bound the Big Data Universe

Big Data Challenges

Is it possible to empower people through Big Data?

Big Data Explains: The Big Data Dilemma

Understanding Big Data - The Inevitable Mammoth

What is Big Data?

More Data Is Not Always Better

Big Data Is Not Large Data

Predicting the Future with Big Data

Explore content categories

More articles by Ronald Reck

Using Sentiment Detection for Determining Bank Efficiency Ratio

Top Business Uses of Predictive Analytics

Taking the UD out of CRUD

Cross Cloud R Performance Comparison

Blistering Speed for your R Analysis

RRECKTEK’s Predictive Analytics Framework version 1.8 is now available to 17 US Intelligence Agencies in a Cloud Marketplace