Descriptive Statistics

Descriptive Statistics

“I couldn’t claim that I was smarter than sixty-five other guys — but the average of sixty-five other guys, certainly!” ― Richard P. Feynman

Descriptive Analytics is about finding “what has happened” by summarizing the data using innovative methods and analyzing past data using simple queries. It is distinguished from inferential statistics (or inductive statistics), in that descriptive statistics aims to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent.

Table of Contents

  1. Introduction to Descriptive Statistics
  2. Data Types and Scales
  3. Types of Data Measurement Scales
  4. Population and Sample
  5. Measure of Central Tendency
  6. Measure Of Variance
  7. Measure of Shape

Introduction

Descriptive analytics is the starting point of analytics based solution to problems. It helps in understanding the data and provide the predictive and prescriptive analytics. It is very essential for data science professional to run the descriptive statistics and cull out the insights to do further analysis.

No alt text provided for this image


For example: retailer such as Target or Reliance would like to know top 5 products that are sold by region, by city, by store, etc.

Data Types and Scale

Data is classified into different categories based on data structure and scale of measurement of the variables.

Structured and Unstructured Data

Structured Data : Data that has originally been organized in form of rows and columns or formatted database.

Unstructured data (or unstructured information) : Any data that originally does not exist in matrix form like rows and columns.

One of the main challenge in dealing with unstructured data is converting it to structured form to do further analysis and model development.

For example: click stream data, e-mails, feedback data, survey data, etc.

Cross-Sectional, Time Series and Panel Data

Cross-Sectional Data : A data collected on many variables at the same time or during same time.

Example: Closing price of 30 different tech stock on 16th August 2019


No alt text provided for this image

Time Series Data : Data collected for single variable over several time interval.

For example: Monthly sales of soaps over a 5 year period

Panel Data : Data collected on several variables over several interval of time is called panel data. It is also called longitudinal data.

For example: Unemployment rate of different countries collected over several years.


Type of Data Measurement Scale

Structured data can be either numeric or alpha numeric and may follow different scales of measurement. It is very important to understand the data type of the variable in order to implement correct measurement scale.

Nominal Scale : nominal scale measurement normally deals only with non-numeric (quantitative) variables or where numbers have no value.

No alt text provided for this image

Ordinal Scale : Ordinal scale reports the ranking and ordering of the data without actually establishing the degree of variation between them.

No alt text provided for this image

Interval Scale : The interval scale is defined as a quantitative measurement scale where the difference between 2 variables is meaningful. An easy way to remember interval scale is that subtraction is defined between the two variables.

Interval data can be discrete — with whole numbers like 8 degrees, 4 years, 2 months etc. or continuous — with fractional numbers like 12.2 degrees, 3.5 weeks or 4.2 miles.

No alt text provided for this image

Ratio Scale : Ratio scale is a type of variable measurement scale which is quantitative in nature. Ratio scale allows any researcher to compare the intervals or differences. It possesses a zero point or character of origin.

A ratio scale is the most informative scale as it tends to tell about the order and number of the object between the values of the scale.

For example, the temperature outside is 0-degree Celsius. 0 degree doesn’t mean it’s not hot or cold, it is a value.

Population and Sample

Population : Set of all the possible observations for a given context of the problem. The size of population can be very large in many cases.

For example: All the members of Mathematical society.

Or, All daily maximum temperatures in July for major U.S. cities.

No alt text provided for this image

Sample : A sample is a smaller group of members of a population selected to represent the population. An incorrect sampling may result in bias and incorrect inference about the population.

No alt text provided for this image


Measure of Central Tendency

Measure of central tendency are the measure that are used for describing data using single value.

MeanMedian, and Mode are the three parameters of central tendency.

Mean

Mean is the arithmetic average value of data. Assume that data has n observation in a sample, let Xi be the value of ith observation.

Therefore mean is given as:

Mean of n observation of data
No alt text provided for this image

Median ( or Mid Value)

Median is the value that divides the data in two equal parts, that is , proportion of data below and above median will be 50%.

No alt text provided for this image
No alt text provided for this image

Mode

Mode is most frequently occurring value in a dataset.

No alt text provided for this image
No alt text provided for this image

Measure of Variance

One of the primary objective of data analytics is to understand the variability in data. Variability in data is measured using the following measures:

  1. Range
  2. Inter-Quartile Distance (IQD)
  3. Variance
  4. Standard Deviation

Range

It is the difference between the minimum and maximum value of data. It measures the spread of data.

No alt text provided for this image

Inter-Quartile Distance

Inter-Quartile Distance (IQD), also called Inter-Quartile Range, measures the distance between Quartile 1 (Q1) and Quartile 3 (Q3). IQD is useful measure to identify outliers in data.

No alt text provided for this image
No alt text provided for this image

Values of data below Q1–1.5IQD and above Q3+1.5IQD are classified as outliers.

Variance

Variance is the variability in data from the mean value. Variance for population σ2 is given by:

No alt text provided for this image
No alt text provided for this image

Standard Deviation

The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance.

No alt text provided for this image
No alt text provided for this image

Chebyshev’s Theorem

Chebyshev’s theorem (also called as Chebyshev’s inequality) is an empirical rule that allow us to predict proportion of observation that is likely to lie between an interval defined using mean and standard deviation.

No alt text provided for this image
No alt text provided for this image

Measure of Shape — Skewness and Kurtosis

Skewness : Skewness is asymmetry in a statistical distribution, in which the curve appears distorted or skewed either to the left or to the right. Skewness can be quantified to define the extent to which a distribution differs from a normal distribution.

No alt text provided for this image


Kurtosis : Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.

An excess kurtosis is a metric that compares the kurtosis of a distribution against the kurtosis of a normal distribution. The kurtosis of a normal distribution equals 3. Therefore, the excess kurtosis is found using the formula below:

Excess Kurtosis = Kurtosis — 3

Types of Kurtosis

1. Mesokurtic

Data that follows a mesokurtic distribution shows an excess kurtosis of zero or close to zero. It means that if the data follows a normal distribution, it follows a mesokurtic distribution.

2. Leptokurtic

Leptokurtic indicates a positive excess kurtosis distribution. The leptokurtic distribution shows heavy tails on either side, indicating the large outliers.

3. Platykurtic

A platykurtic distribution shows a negative excess kurtosis. The kurtosis reveals a distribution with flat tails. The flat tails indicate the small outliers in a distribution.

No alt text provided for this image

Summary

This post gave introduction to descriptive statistics. You learned about what is descriptive statistics and how this is important in any data science or analytical solution. You also learned about the data types and their measurement scale. Then we discussed the measure of central tendency, variance and measure of shape. Lastly, you learned about Leptokurtic, Mesokurtic and Platykurtic distributions and calculating each summary statistics in python.

To view or add a comment, sign in

More articles by Somya Rai

Others also viewed

Explore content categories