Boxplots Using Matplotlib
Boxplots are used to visualize the data distribution and compare the distribution amongst various categorical groups. We can also interpret the existing outliers across the groups.
This article discusses the 5 data points which are imperative to interpret a boxplot and at the end, we discuss the implementation of boxplots using Matplotlib.
Let’s do this math on a toy dataset:
Before concluding the minimum and maximum values, it’s important to check for outliers because outliers are not going to be a part of the boxplots.
To find the outliers, let’s do quick math to find out Inter-Quartile Range (IQR).
Since the entire dataset values fall between the above range, it is safe to say that there are no outliers in this data.
Let’s plot this on a number line:
Recommended by LinkedIn
Examples of scenarios where we can use boxplots:
This week I worked on a case study on university student data where I need to find out the students who are performing poorly in their academics and eventually are at risk of dropping out of college. After priming the raw data into moderately clean data, I used Matplotlib to visualize a few descriptive analyses.
To get an overview of the performance across the majors, it’s a good idea to see the distribution of students’ GPAs and compare the same between their majors.
Luckily, we don’t have to do all the math we did above because ‘matplotlib.boxplot’ does everything for us.
These two lines of code are all we need to plot the boxplot, however, we are going to make it presentable by adding a few parameters such as labels, colors, and titles as shown in the below snippet.
Note: Matplotlib has been imported as 'plt'. (import matplotlib as plt).
The above boxplots show that a lot of students from ‘major_1’ are having lower GPAs and are likely to drop out of college due to compliance issues. Similarly, we see that students from ‘major_3’ are performing well with a minimum GPA of 3.1 and a median GPA of 3.7.
Now, we know that the target population is the students from ‘major_3’ according to the problem statement.