Data Fallacy – Why sometimes it is good to visualize data also rather than just looking at statistics.

Data Fallacy – Why sometimes it is good to visualize data also rather than just looking at statistics.

In this short article, I try to bring out why understanding the context is very important when one is dealing with data. The popular saying that “Numbers don’t lie” is true however if you have to conclude something after analyzing numbers, it is very important to understand the distribution or the spread of data and the relationships between various data points. It is very enticing to jump directly to statistics and conclude about relationships and underlying data behaviour, however it always advisable to visualize data as well before concluding about the inherent properties of data.

To illustrate this, I bring in the concept of Anscombe’s Quartet. Here we have four sets of Ys and Xs. If you seek descriptive statistics for these clusters all Xs will have the same mean, standard deviation and variance and all Ys will have the same mean, standard deviation and variance. Even the R squared values, that in a way is a measure of relationship between X and Y are also same. Does that mean all Xs and all Ys are sort of similar or are comparable data points?

No alt text provided for this image

Mean for all Xs is 9, standard deviation is 3.16 and variance is 10. Similarly all Ys have a mean of 9, standard deviation as 3.16 and variance as 10.

If you plot Y vs X for each set, you will clearly see that the visualization of all the four datasets is completely different. The fact that descriptive statistics may have compelled you to think that the relationships between X and Y are similar or all Xs and all Ys. This has now been proven wrong once you see how different these scatter charts are. The beta coefficients of regression along with R squared values (0.66) are also similar for all four datasets.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

This set of four data points is called an Anscombe's Quartet and is a stark reminder that technically data doesn’t lie but visualization is also very important else one may fall into the trap of wrong conclusions and recommendations. Data cannot be blamed here but the way it is interpreted will be. This is a typical case where visualization has come to our rescue in interpreting data just like many other instances where other statistical points also need to be seen rather than analyzing data in isolation. 



To view or add a comment, sign in

More articles by Saurabh Joshi

Others also viewed

Explore content categories