How BIG is your data?
By Ronald P. Reck 20170419
Fast data is the new big data. Currently, the volume of data has been the big challenge for those trying to understand what computers are doing and what is really going on behind the scenes. Every day data warriors employ strategies like Hadoop and Spark to tackle problems the best way they know how.
Sadly, this is based on an aged paradigm. Everyone has a different definition for the term BIG DATA. The real definition is, in essence, enough data that it can be overwhelming. One thing to know is that the sheer number of computers and devices connected to networks has surpassed the number of people on the planet. Combine that with the fact that new sources for data are growing quicker than the number of people. This says the internet is increasingly about nodes communicating and less about people doing so.
What does this mean?
It means that the characterization of BIG DATA will change and that streaming analysis will be the only really viable strategy for dealing with the increased velocity of new data. The past paradigm that uses aggregation and analysis will be recognized for what it is, quaint and antiquated. Think "irradescent bulb".
We will still have the volume problem but once we recognize why we really have it, we will forced to modify what is necessarily a failing strategy. Do you wonder what this new strategy looks like ?
DECISIONS NOT DATA
It is likely that making data available through an API or for download will be replaced by making an analysis, synthesis or decision available. Think about credit scores, when someone is making a credit decision, they don't get your entire credit history, merely a single number they can base their decision upon. The future will have more situations similar to biological systems where information from sensory organs is reduced for the purpose at hand. We will hand off decisions not data.
QUALITY
Anyone experienced with data analysis will agree that analysis is largely data preparation not interpretation. Just like "painting" is really about taping, edging, drop cloths and supplies, data analysis is really about strategies for compensating for poor data quality. This can involve any of the six dimensions: completness, conformity, consistency, accuracy, duplication, and integrity. https://www.whitepapers.em360tech.com/wp-content/files_mf/1407250286DAMAUKDQDimensionsWhitePaperR37.pdf
In the end, it cannot continue like it is now.
In the future, limitations in data quality will be increasingly intolerable and data's suitability will be evaluated based on its quality. Data providers will necessarily perform the validation tasks because consumers simply will not be able to do it fast enough. Quality will be the hallmark by we which judge data sources, not the afterthought it is now.
TRUTH VALUE
Anyone who has dealt with large enough datasets especially from different sources knows that conflicts in information is inevitable. Most datamodels are flat and treat all data as ground truth. Everything is modeled as being known equally well, even though that is not accurate. For example, people agree that newer information is more valuable that old information (probably based on the open world assumption) but don't encode the data that way. Eventually, we will be forced to keep track of how well we know something to make probability based decisions. Veracity, otherwise known as truth value will be based on validation.
Good points. I just rolled off a Big Data project - quality and velocity were front & center.