Seaborn Tutorial in Python 3.6+

Seaborn Tutorial in Python 3.6+

Why Jupyter notebook?

Jupyter Notebook is a computational notebook that is a free, open-source, interactive web tool used by users to combine code, output, text and other multimedia resources in one document. For more details on how to install Jupyter notebook, you may refer to https://jupyter.readthedocs.io/en/latest/install/notebook-classic.html

About Seaborn

Seaborn is an open-source library in Python that is widely used for making statistical graphics, one of many data visualization libraries in Python. However, it is unique in the sense that it is built on top of Matplotlib. I have chosen Seaborn for this tutorial because it is one of the most intuitive in terms of providing a variety of visualization patterns. It is also more integrated than most visualisation libraries for working with the Pandas Dataframe. It also allows plotting graphics in a much more simpler coding fashion through straightforward sets of inbuilt methods. However, each visualisation library comes with its pros and cons, but Seaborn is definitely the library to begin your visusalisation journey in Python.

Installation

Official releases of seaborn can be installed from PyPI:

! pip install seaborn

The library is also included as part of the Anaconda distribution:

conda install seaborn

This is directly extracted from https://seaborn.pydata.org/installing.html. You may refer to this should you want to know more about debugging install issues, etc.

Dependencies

Python 3.6+

Required Dependencies

There following libraries will be installed when you have performed installation on Seaborn:

numpy, scipy, pandas, matplotlib

Optional Dependencies

1) statsmodels, for advanced regression plots

2) fastcluster, for clustering large matrices

About the dataset

The World Happiness Dataset is data retrieved from the Gallup World Poll that investigates the global state of happiness. This data is useful for governments as they seek to gain global recognition, with large scale companies looking at various indicators in pursuit of quality of life for their employees, and for civil societies to use such indicators when it comes to policy decision-making. Historically, there has been numerous studies made to measure well-being when it comes to economic and social prosperity of nations.

I have taken the dataset from Kaggle and it can be found at https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021?select=world-happiness-report.csv.

For the purpose of this tutorial, we will only be using the world-happiness-report.csv dataset which incorporates 15 years of data from 2005 to 2020. However, please feel free to download and merge the newer world-happiness-report-2021.csv dataset as well. The idea behind using this dataset is primarily because the visualisation techniques used can be applied concurrently to draw meaningful insights into the data.

Import Relevant Libraries & Dataset

No alt text provided for this image

Data Understanding & Cleaning

How the data looks like

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Visualisation Technique

This section provides learners with a procedural flow towards creating Scatterplots, SPLOMs & Bubble-plots. The flow is deliberately set this way because SPLOMs & Bubble-plots rely on the fundamental concept of Scatterplots.

Import Seaborn visualization library

The seaborn package was developed based on the Matplotlib library. It is used to create more attractive and informative statistical graphics. While seaborn is a different package, it can also be used to develop the attractiveness of matplotlib graphics. The following codes are used to import the seaborn library.

No alt text provided for this image

What is a scatterplot?

A scatterplot is used to compare two quantitative variables in an attempt to help us derive if there is a meaningful relationship between these two variables. For instance, if an increase in one quantitative variable will generally lead to an increase in another. The dots in a scatterplot represent values for two different numeric variables and the positioning of each point in the plot on both the y- and x-axis indicate one data point.

No alt text provided for this image
No alt text provided for this image

Therefore whilst assigning both hue and style to different quantitative variables in the dataset will allow for more colour and design variable through markers and this can be implemented independently and all at once, in just a single plot! From here we can see that most of the data points lie in the right hand corner of the plot, which can be generalise to mean that the higher the freedom to make life choices in countries, generally, the higher the corruption. Here, we can see that the year variable is not as useful because not all markers are at a specific location in the plot. It is scattered all over which indicates that even through time, the changes of freedom to make life choices do not affect the perception of corruption in countries.

SPLOMS

This is a great segway to our first and intended visualisation technique, which is the scatterplot matrix (SPLOMs). As captured in the name, although uncommonly used, SPLOM is a visualisation technique that is used to compare numerous quantitative variables to try and see if there is a meaningful relationship.

Plotting a basic SPLOMs using seaborn

No alt text provided for this image

The pairplot is built on two charts, the histogram and the scatter plot. On the diagonal, we see that the histogram enables us to look at the distribution of one variable while the scatter plots on the sides show the relationship (be it strong or weak) between two quantitative variables. However, there are too many quantitative variables here to visualise. Therefore, we should shrink the number of quantitative variables and keep only the variables we are most interested in.

Plotting a detailed SPLOMs using seaborn

Even after shrinking the number of variables, we can still make the SPLOM even more accessible, similar to what we did for scatterplots! Similarly, we can add colour to the datapoints in the plots based the intended categorical variable (which in this case, we want to use time) and similarly, this is as easy as it was when trying to achieve this in a scatterplot. We just simply have to add the "hue" inbuilt function to call.

Additionally, stacked histograms are not very interpretable (on the diagonal in chart above). Let us replace it easily with a density plot for showing each variable's distribution by year (this makes it much more interpretable) and we can do this by simply passing a kde to the diag_kind function. Let us also adjust the transparency and edgecolour of each single datapoint in all the scatterplots.

Now that we have added a more informative SPLOM, there are still some amendments to be had. Although the chart will appear more informative, we will still generally find it visually unappealing because there are just still too many quantitative variables to consider at any one glance. Let us further reduce the above number of dimensions by plotting only the years 2017 and to 2019. We will still color by the year variable. To minimize the number of variables or columns in the SPLOM, we pass in a list of vars to the function. This way it allows us to better visually focus on only a few variables at a time. We will re-adjusting to more noticable colours to differentiate the 3 different years, and we will concurrently be calling on the size inbuilt function 's' to adjust for the size of the datapoints for easier readability.

No alt text provided for this image
No alt text provided for this image


Advantages of SPLOMs

SPLOM as a visualisation technique is especially useful when we have a dataset with (ideally more than 4) quantitative variables. Through the entire scatterplot matrix that is easily done in one simple line of code, we are able to quickly visually observe any meaningful relationship between each pairs of quantitative variables.

In this dataset, we see that For example, Log GDP per capita has a positive correlation with social support and life ladder, and the same can be drawn between life ladder and social support. Therefore in one glance, we are able to draw meaningful relationships and relate that to theory straightaway for a more insightful analysis. SPLOM is an extremely informative visualisation technique which offers an efficient solution to the important question of "what are the relationship between each pair of quantitative variables in my dataset?" And again, we see that time does not play any role in influencing the relationship between the two quantitative variables.

Disadvantages of SPLOMs

Although SPLOM offers a carefree and quick solution to an otherwise complex problem of relationship correlation observation, SPLOM becomes drastically less effective if our dataset is predominantly qualitative variables. In this case, the concept of scatterplots themselves become extremely limited when comparing either nominal or ordinal variables with one other quantitative variable. The visualisation technique will then lose its purpose as the insights drawn from such pair combinations are meaningless.

Another reason why I wanted to use this dataset is because SPLOMs become less meaningful when there are too many quantitative variables. This happiness dataset initially contains 10 quantitative variables but after removing for 2 (positive affect and negative affect), the SPLOM becomes less messy visually. I would say this dataset is already pushing it with 8 quantitative variables, we can certainly make the above SPLOMs better if we reduce it to 6 or 5, which is what we did! We went even further by just passing on vars to concentrate on a few variables at a time for more effective visualisation. Having a large cluster of scatter plots and make visual inspection of data nearly impossible.

Bubbleplot

What is a bubbleplot?

Drawing from the fundamental concept of scatterplots, bubble plots are quite similar. However, they can be more effective as they can reveal insights through an additional z dimension. Plotting on an x and y axis, scatterplots allow us to obeserve any correlation between the two quantitative variables. However, we can add another quantitative variable into the mix and use the size visual to aid in the story telling of the visualisation technique. Bubble plots are easily understanable, which is why this tutorial begun with the idea of scatterplots, and so now we are going to build on that concept! In summary, bubble plots help us in understanding the information across more than two dimensions.

No alt text provided for this image
No alt text provided for this image

Here with close examination of the bubble chart, we can see that for countries with a high healthy life expectancy at birth figure, they are accompanied by high log GDP per capita values. However, although this chart is still quite ineffective to the human perception system, because the log GDP bubble sizes are not too differentiable, we shall adjust that to make the bubble plot more meaningful and effective. At the same time, we shall use a different plotting method. Instead of using scatterplot, we shall use catplot instead. This is very similar to swarmplot above, all we need to do is call on the catplot function for seaborn.

No alt text provided for this image
No alt text provided for this image

Advantages of using Bubbleplots

Using bubble plots allows us to notice any observatable trends while concurrently allowing us to see relationships between two quantitative variables. In our above example we plotted, we see that the first world countries having higher values for healthy life expectancy at births and that their bubble sizes dictate that healthy life expectancy at birth must be accompanied by high log gdp per capita. From here, we observe that there must be a correlation between these two quantitative variables.

Disadvantages of using Bubbleplots

One of the main limitation of using bubbleplots is that when the number of values in a categorical variable increase (eg: number of countries) then this bubble chart will have very limited effectiveness. This is why we begun by selectively picking only 15 countries before putting it out in a visual form. Even then as we can see, some of these bubbles overlap each other and makes it harder for one to draw insights because of the clutter.

Conclusion

I hope this tutorial has been useful. It begun with teaching how one can adapt to the python learning environment, understanding the Seaborn visualisation library, when to use it, how to go about installing it, approaches to data cleaning, using the scatterplot, SPLOM & bubbleplot methods, recognising that there are other types of visualisation approaches to achieve the same charts and most importantly, understanding when to use which visualisation technique best.

This notebook demonstrates:

Rule 2: Document the process, not just the results. This tutorial has documented the detailed steps to arrive at each Scatterplot, SPLOM & Bubbleplot. Each line of code has been explained before-hand and closely guides the reader when performing the same coding steps. Other intuitive and/or repeated codes have been also supported in the code snippets as well (# commented out). The workflow has helped greatly in not repeating the same documentation for the same kinds of steps, which has helped organised the tutorial. There are different methods and enhancements used to all three visualisation techniques introduced, and as the workflow supports, each visualisation step has been layered on top of the previous one in an easily disgestable fashion. The tutorial has been enumerated in a step-by-step manner, allowing learners to understand the different kinds of variations used.

Rule 3: Use cell divisions to make steps clear. Each code cell has been designed to accomplish the task of creating one visualization. They are kept short and commented out comments are organised in such a way as to avoid messy codes. Each code has been also preceded with a markdown cell to describe the visualization to be implemented. A rough Table of content is first introduced in this notebook for easy referencing and guidance of this tutorial.

Rule 5: Record dependencies. At the beginning of the notebook, Python and Seaborn library versions are specified.

Rule 9: Design your notebooks to be read, run, and explored. The notebook is uploaded to GitHub for others to access. In addtion, a static HTML version of the notebook is created for additional support of readability. You may visit https://github.com/Michwynn/MADS

To view or add a comment, sign in

Others also viewed

Explore content categories