Simpson's paradox and Data
Data for statistical analysis can be tricky information especially when you interpret it the wrong way. In terms of aggregation, the best statistical paradox is depicted by Simpson's paradox named after Edward Simpson, after he published a paper on Contingency Tables in the Journal of the Royal Statistical Society in 1950s.
So what is this paradox and why is it important for Data Analysts? Since we are moving towards big data, caution must be exercised when aggregating small data sets into bigger sets. That leads to a larger challenge of accurate interpretation. What Simpson's paradox refers to is the reversal in correlation that takes place whenever data is viewed in a disaggregated state vis-à-vis amalgamation for two variables upon conditioning of a third variable. The insights can be exactly opposing!
This paradoxical result can skew insights into something disagreeable! The University of California in Berkeley experienced it first hand in 1970s when a discrimination suit was filed against it for accepting more male students than female! Well in terms of analysis the result of aggregation of admissions was a paradox when compared to individual admissions in different subjects for both genders! So in simple terms when you view the admissions for males in say three subjects the percentage is low when compared to females in same subjects in terms of number of applications and acceptance rate. But when you view the total number of male admissions for all three subject combined compared to females for acceptance rates the opposite is true! All about aggregation!
Hence statistics can be deceptive and sometimes have much deeper meaning than apparent! It can also have impact on decisions involving industries where general effects or practices involving a number of variables are studied and averaged.
Karl Pearson on statistics quoted, “Statistics is the grammar of science”. Data is most often telling a story and you may need to read in between lines to understand its underlying implication specially when making decisions!
As the saying goes.."Bigger is better" and the complexity increases proportionately.. understanding your data is the key and learning to read between the lines comes with experience and having a 360 degree view.. Data is the next Intel Inside - says Tim O Reilly Will large datasets offer a higher form of intelligence and knowledge that can generate insights that was previously not available? Yes it can and it will...