DATA VISUALIZATION-DISTRIBUTION PLOT

This article will take a comprehensive look at using distribution plot in Python using the matplotlib and seaborn libraries.

Histograms

A great way to get started exploring a single variable is with the histogram. A histogram divides the variable into bins, counts the data points in each bin, and shows the bins on the x-axis and the counts on the y-axis. 

To make a basic histogram in Python, we can use either matplotlib or seaborn. The code below shows function calls in both libraries that create equivalent figures. For the plot calls, we specify the binwidth by the number of bins. For this plot, I will use bins that are 5 minutes in length, which means that the number of bins will be the range of the data (from -60 to 120 minutes) divided by the binwidth, 5 minutes ( bins = int(180/5)).

# matplotlib histogram

plt.hist(flight['arr_delay'],bins=int(180/5),color='blue',edgecolor = 'black')

#seaborn diagram

sns.distplot(flight['arr_delay'], bins=int(180/5),kde=False ,

        hist=True, hist_kws={'edgecolor':'black'},

        color='blue')

Kernel density estimation

The kernel density estimate may be less familiar, but it can be a useful tool for plotting the shape of a distribution. Like the histogram, the KDE plots encode the density of observations on one axis with height along the other axis:

#seaborn diagram

sns.distplot(flight['arr_delay'], bins=int(180/5),kde=True ,

        hist=False, hist_kws={'edgecolor':'black'},

        color='blue')

Scatterplots

The most familiar way to visualize a bivariate distribution is a scatterplot, where each observation is shown with point at the x and y values. This is analogous to a rug plot on two dimensions. You can draw a scatterplot with the matplotlib plt.scatter function, and it is also the default kind of plot shown by the jointplot() function:

#JOINTPLOT

#seaborn

sns.jointplot(x='arr_time', y='arr_delay', data=flight, kind='scatter')

#matplotlib

plt.scatter(x='arr_time', y='arr_delay', data=flight)

Pairplot

To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This creates a matrix of axes and shows the relationship for each pair of columns in a DataFrame. by default, it also draws the univariate distribution of each variable on the diagonal 

#seaborn

sns.pairplot(data=flight)

sns.pairplot(data, hue=None, hue_order=None, palette=None, vars=None, x_vars=None, y_vars=None, kind='scatter', diag_kind='hist', markers=None, size=2.5, aspect=1, dropna=True, plot_kws=None, diag_kws=None, grid_kws=None)


Rugplot

A rugplot is a graph that places a dash horizontally with each occurrence of an item in a dataset.Areas where there is great occurrence of an item see a greater density of these dashes.Areas where there is little occurrence of an item see just occasional dashes.This is the essence of a rugplot.

#seaborn

sns.rugplot(flight['arr_delay'])


To view or add a comment, sign in

More articles by Snehanshu Sengupta

  • CLUSTERING-PART1

    Clustering algorithm is generally an attempt to solve a classification problem in which the attempt is to find…

  • Important topics on Regression

    Homoscedasticity Homoscedasticity means that given a set of data, any two subsets will have similar levels of variance.…

  • Hypothesis Testing

    Differences of Groups: 1. Chi Square • Compares observed frequencies to expected frequencies example: Is the…

  • CORRELATION

    Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. For…

  • Linear Algebra

    Introduction Linear Algebra is a continuous form of mathematics and is applied throughout science and engineering…

    2 Comments
  • Inferential Statistics

    It is about using data from sample and then making inferences about the larger population from which the sample is…

  • STATISTICS BASIC

    Descriptive statistics Descriptive statistics are used to organize or summarize a particular set of measurements. In…

    2 Comments

Explore content categories