Process Intelligence - Can we have too much data?
Many of us have been using statistics for many years to predict the performance of any number of business metrics. Six Sigma coupled with various statistical software packages opened statistical analysis to the masses.
T-Tests, ANOVA, and nonparametric techniques allowed the use of many data types and gave us the ability to use regression models for predicting future performance. It was well understood that these predictive models were susceptible to subsets of data, limited sample size, missing factors, and multiple distributions that opened the door for risk to the client.
Enter the age of Big Data
The beginning of this century began to open the floodgates of data. Automation and the increased use of SAP, ERP, and other sources of data collection on websites, gave rise to the age of Machine Learning where software could analyze GB of information and automate the process of looking for key factors that would take the efforts of scores of Black Belts many weeks to duplicate.
The essence of the ‘what’ the data scientist does has not changed. The skill sets involving statistics and programming still lead to the same goal - to develop a model that will enable a business to predict future performance and to discover hidden variables to enhance that effort.
But Big Data can be a siren’s song. The large sets of information can still lead to the same traps that have limited success in the past. Being aware of these pitfalls can help us provide a client a clearer picture with the models that emerge. Despite these risks many large businesses are successfully using predictive models to improve margins and streamline business processes. Let’s take a look at the pitfalls.
Over-Modeling
Since large data sets have lifted the restrictions of small sample sizes, an unintended consequence arises. The data scientist is now able to detect very small effects from many variables. As we all know, the “p-value” has been the gold standard to determine whether a factor was significant to a process. However, we have to look at that factor to see if it is a 50%, 5% or 0.5% impact. It’s the ‘distinction without a difference’ conundrum. There is no reason to add a factor to a model that has a tiny effect size. This increases the risk of a team chasing improvements or adding investments to a factor that does not practically move the needle.
Under-Modeling
This trap is related to over-modeling. Many data sets contain fields that naturally lead to factors for analysis. When the data scientist has rolled up the complete model, he or she must know the R-squared or predictive power of the model. If only half the factors have been identified as important, that means the data scientist must track down the missing factors or more importantly, realize and communicate the limitations of the predictions. There is great pressure to supply answers to the C-suite. Sharing the risks of a model are just as important as what it shows. The data is the data.
Summary
Big Data has given a data scientist the ability to better inform business leaders about the health of their business and “way forward” strategies. Keeping aware of the fundamentals of statistics that impact not only small sample sizes but large ones, we can rest assured that the fidelity of our predictive models can lead us to the next breakthrough ability…Artificial Intelligence.
Nicely done. I recall conversations that we have had relative to non-parametric methods used to reduce the number of factors to be considered prior to testing for significance. Recall the BOB/WOW contrast looking for factors that cause outcomes in the tails rather than the body of a distribution.
The issue with Big Data is that it is misunderstood for Designed Experiments or Randomized Orthogonal Dedigns to be more specific. Thus, colinearity will make all the "insights^ just hunches at best and misleading at worse. The second issue with all observational data is that the range of variation is usually small, if a well controlled process is studied. The other issue is that if the absolute effect is small, the gains are negligible regardless of the p value. I would much rather chase a practically significant effect than a statistical significant negligible one. Big Data is often Tiny Insights. I am not saying ignore it, just put some thought into next steps in investigation. What scientifics principle govern the process effects? Are these process variables in the science model not just statistical model? Are we looking at the full process range possible? If the Stats make little scientific sense, something is wrong.
Nice. Fits like a tailor made suit.