Visualizing Uncertainty
I was going through a research paper [2] which talked about how visualizations treat data as full populations rather than samples of populations (which is often the case). While such visualizations are simple to interpret, they ignore the uncertainties of assuming a slice to be representative of the pie. I decided to implement a method mentioned within the paper and in this article I outline my learnings to make someone else's first step into the world of statistics a little easier. Let's dive in!
Xanardia
Say a data analyst, James, reports to the Energy Minister of Xanardia, a small mountainous country, who wishes to take advantage of the drop in fuel prices due to the pandemic and stockpile the fuel resources for the next year. James, a simpleton, decides looking at the averages for the previous 5 years shall suffice to determine the next year's stock. Looking at the figure below, it is a trivial task to infer that in 2017 and 2019, the fuel consumption exceeded 40000 metric tonnes. This simple and absolute representation of data in bar graphs are what makes them so popular.
While comparing bar heights was easy, it would be imprudent for James to blindly rely on the averages from a portion of Xanardia's population (also called a point estimation). What he needs is some metric which shall give some idea about the actual yearly averages (μ) for the entire population. Here 'uncertain visualizations' come to our rescue.
Before we visualize uncertainity, we shall consider how uncertainty is quantified. Assume James wishes to ponder a bit deeper and collects the 2019 fuel statistics from 10 samples of Xanardia's populace. He calculates 10 averages and then determines their Standard Deviation (SD). This is what statisticians call Standard Error of Mean (SEM) or just Standard Error (SE).
SEM tells us about the precision of the mean as opposed to SD of the distribution, which expresses the spread of data.
However, collecting data multiple times from different samples of a population is more often than not impossible. Imagine conducting the Census 5 times for 300 million people in the US! Fortunately, statisticians have come up with an 'acceptably accurate' estimation for the SEM, simply divide the SD of the data sample (s), by the square root of number of samples (n). Do keep in mind that the actual SD of the distribution (σ) has been approximated to 's' from a single sample.
In our case, rather than conduct the experiment repeatedly, James would just have to determine the standard deviation of the 2019 fuel consumption then divide by the square root of the number of people participating in the survey.
Alright, armed with our newfound knowledge, let's look at the problem again. While we cannot determine the precise value of the mean fuel consumed in 2019, we can estimate the range where the mean is likely to lie. We would like to determine with a fair amount of certainty where our mean lies, the range is called confidence interval for the mean.
for experiments with large enough number of samples regardless of the underlying distribution, their means follow the normal distribution [4].
Since we are considering a country-sized sample, it makes sense to presume that the means for 2019 follow the normal distribution. The normal distribution has a well known property that around 95% of the data lies within 2 standard deviations of the mean. Hence James can say with approximately 95% confidence that the actual mean fuel consumed in 2019 in Xanardia lies between 2 (1.96 to be exact) standard errors (SE) of the observed mean of 47743 MT.
Rinse and repeat for the means from 2016 to 2018. The 95% confidence interval for the mean is represented by the 'I' on each bar. The interval for 2016 seems oddly large. This can either be because of a small sample size or high standard deviation within the sample.
So this visualization gives us a much better picture of the uncertainties in data. There is a problem though, how shall James use this uncertainty? Earlier, we to asked the question, 'Which year(s) have fuel consumption higher than 42000 MT?' that has now changed to 'How likely are year(s) consumption to be greater than 42000 MT?' What was earlier definite has become probabilistic. Forget government officials like James, even statisticians struggle to make decisions when confronted with such 'uncertain' scenarios [2].
This is what my visualization inspired from the paper attempts to address
Here, James adjusts the threshold value and the colors of the bar indicate the probability that the consumption for the respective year exceeded the threshold value. In most visualizations with uncertainties, a glut of parameters to consider makes them a poor choice for decision making and ends up puzzling users instead of adding value.
Simple visualizations with interactivity allow users to take better decisions in uncertain scenarios. While there are other methods of visualizing data (like scatter-plots, line graphs) and many parameters other than averages worthy of consideration for 'uncertain visualizations', this is a step in the right direction.
P.S. I struggled a bit with addressing the distinct colors for distinct intervals in matplotlib. Check out my code on github. Hopefully it can help you where Stack-overflow can't :)
Thanks a ton for reading my (first) article! Any suggestions are most appreciated.
References used:
- 'Yellow road signage at daytime', photo by Robert Ruggiero on Unsplash
- Ferreira, N., Fisher, D., & Konig, A. C. (2014, April). Sample-oriented task-driven visualizations: allowing users to make better, more confident decisions. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 571-580). ACM.
- 'City at Night', photo by Andrew Schultz on Unsplash.
- University of Newcastle, Standard Errors and Confidence Intervals.
A really well written article, great job. Looking forward to reading more such edifying articles from you.
Amazing work!
Great work !
Pranav Iyer This is one of the most insightful articles that I have read which is written by my batch mates. Quite well done! Keep sharing stuff like this. 🚀