Boxplots Using Matplotlib
from Google

Boxplots Using Matplotlib

Boxplots are used to visualize the data distribution and compare the distribution amongst various categorical groups. We can also interpret the existing outliers across the groups.

This article discusses the 5 data points which are imperative to interpret a boxplot and at the end, we discuss the implementation of boxplots using Matplotlib.

  1. Min: least number from an ordered dataset.
  2. Q1: Quartile 1 or 25th percentile of the ordered dataset
  3. Q2: Median value of the ordered dataset
  4. Q3: Quartile 3 or 75th percentile of the ordered dataset.
  5. Max: largest number from an ordered dataset.

Let’s do this math on a toy dataset:

No alt text provided for this image

Before concluding the minimum and maximum values, it’s important to check for outliers because outliers are not going to be a part of the boxplots.

To find the outliers, let’s do quick math to find out Inter-Quartile Range (IQR).

No alt text provided for this image
No alt text provided for this image

Since the entire dataset values fall between the above range, it is safe to say that there are no outliers in this data.

Let’s plot this on a number line:

No alt text provided for this image

Examples of scenarios where we can use boxplots:

  1. To visualize and compare the scores between basketball teams
  2. To visualize and compare the GPA of the students across various departments or majors.
  3. To visualize and compare the prices of flights to NewYork during the Off-season vs holiday season.

This week I worked on a case study on university student data where I need to find out the students who are performing poorly in their academics and eventually are at risk of dropping out of college. After priming the raw data into moderately clean data, I used Matplotlib to visualize a few descriptive analyses.

To get an overview of the performance across the majors, it’s a good idea to see the distribution of students’ GPAs and compare the same between their majors.

Luckily, we don’t have to do all the math we did above because ‘matplotlib.boxplot’ does everything for us.

No alt text provided for this image
No alt text provided for this image

These two lines of code are all we need to plot the boxplot, however, we are going to make it presentable by adding a few parameters such as labels, colors, and titles as shown in the below snippet.

Note: Matplotlib has been imported as 'plt'. (import matplotlib as plt).

No alt text provided for this image
No alt text provided for this image

The above boxplots show that a lot of students from ‘major_1’ are having lower GPAs and are likely to drop out of college due to compliance issues. Similarly, we see that students from ‘major_3’ are performing well with a minimum GPA of 3.1 and a median GPA of 3.7.

Now, we know that the target population is the students from ‘major_3’ according to the problem statement.

To view or add a comment, sign in

More articles by Asha Pondicherry

Others also viewed

Explore content categories