Using ‘Big data’ effectively!
The clutter for ‘Big data’ has reached astronomical proportions. Nearly everyone is talking about it. A recent count of my spam mails on this topic ratified my conclusion. Most companies want to use it strategically,however they are grappling with its tactical application. While ‘Big data’ can be a good analytical tool, companies must guard themselves from the ubiquitous panacea promised by ‘Big Data’.
Following are some of the areas to be cautious about,
1. Base Rate Neglect – ‘Big data’ offers more data for analysis, which can seem to be an advantage but may not work out in reality due to Base Rate Neglect. Consider this example,
A. Chris is a thin Briton who wears spectacles and loves to listen to Beethoven
What is more likely, that Chris is
a. A truck driver or
b. A professor of history at Oxford
(Note that initially we knew that Chris was a thin Briton, but with ‘Big data’ we got additional information that he wears spectacles and loves to hear Beethoven)
Many would answer A as 2 given the multiple correlative perspectives using the additional information offered via ‘Big data’.
This detailed description enticed us to disregard statistical reality that there are far more truck drivers in Britain than there are history professors at Oxford.
2. Causation and Correlation
This statistical nightmare impacts ‘Big data’ analytics even more. A few years back many applauded google for sniffing the outbreak of flu based on search pattern in a particular area or geography. Within years, ‘Nature News’ had sad message to convey: the latest flu outbreak had claimed an unexpected victim - Google Flu Trends. The exact reason as to why Google’s estimate went overboard by a factor of 2 (200%) is not known, may be the rumour of highly potent superbugs did the job of scaring healthy folks get into searching for flu symptoms.
3. Entire data illusion
Real time feeds from Facebook, Twitter and others and that too for all the members’ interactions can give a feeling of comprehensive coverage. But be cautious, while this comprehensive data from social media platforms can offer voluminous data reducing the sampling error, this might increase the sample bias. Here is why, Facebook and twitter members are a part of a similar cohort – young, living in urban and suburban areas.
While analysing data to gain consumers insight we will end up having a huge sample of above cohort, while having no or very little data for consumers outside this cohort. This will skew the overall data available for analysis thus increasing the sample bias drastically making the insights irrelevant.
Like Michelangelo’s answer to Pope on the secret of creating the statue of David –I removed everything that is not David! , the starting point of using ‘Big data’ can be, deciding where ‘Big Data’ cannot be used.
Nilesh, today the data science is predominantly applied for unstructured data, soon there will be huge change in data collection which will entirely change the analytics as a service industry.
Babaji- shouldn't the heading be "how not to use big data" or "where to avoid pitfalls while using big data"