Descriptive Statistics
“I couldn’t claim that I was smarter than sixty-five other guys — but the average of sixty-five other guys, certainly!” ― Richard P. Feynman
Descriptive Analytics is about finding “what has happened” by summarizing the data using innovative methods and analyzing past data using simple queries. It is distinguished from inferential statistics (or inductive statistics), in that descriptive statistics aims to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent.
Table of Contents
Introduction
Descriptive analytics is the starting point of analytics based solution to problems. It helps in understanding the data and provide the predictive and prescriptive analytics. It is very essential for data science professional to run the descriptive statistics and cull out the insights to do further analysis.
For example: retailer such as Target or Reliance would like to know top 5 products that are sold by region, by city, by store, etc.
Data Types and Scale
Data is classified into different categories based on data structure and scale of measurement of the variables.
Structured and Unstructured Data
Structured Data : Data that has originally been organized in form of rows and columns or formatted database.
Unstructured data (or unstructured information) : Any data that originally does not exist in matrix form like rows and columns.
One of the main challenge in dealing with unstructured data is converting it to structured form to do further analysis and model development.
For example: click stream data, e-mails, feedback data, survey data, etc.
Cross-Sectional, Time Series and Panel Data
Cross-Sectional Data : A data collected on many variables at the same time or during same time.
Example: Closing price of 30 different tech stock on 16th August 2019
Time Series Data : Data collected for single variable over several time interval.
For example: Monthly sales of soaps over a 5 year period
Panel Data : Data collected on several variables over several interval of time is called panel data. It is also called longitudinal data.
For example: Unemployment rate of different countries collected over several years.
Type of Data Measurement Scale
Structured data can be either numeric or alpha numeric and may follow different scales of measurement. It is very important to understand the data type of the variable in order to implement correct measurement scale.
Nominal Scale : A nominal scale measurement normally deals only with non-numeric (quantitative) variables or where numbers have no value.
Ordinal Scale : Ordinal scale reports the ranking and ordering of the data without actually establishing the degree of variation between them.
Interval Scale : The interval scale is defined as a quantitative measurement scale where the difference between 2 variables is meaningful. An easy way to remember interval scale is that subtraction is defined between the two variables.
Interval data can be discrete — with whole numbers like 8 degrees, 4 years, 2 months etc. or continuous — with fractional numbers like 12.2 degrees, 3.5 weeks or 4.2 miles.
Ratio Scale : Ratio scale is a type of variable measurement scale which is quantitative in nature. Ratio scale allows any researcher to compare the intervals or differences. It possesses a zero point or character of origin.
A ratio scale is the most informative scale as it tends to tell about the order and number of the object between the values of the scale.
For example, the temperature outside is 0-degree Celsius. 0 degree doesn’t mean it’s not hot or cold, it is a value.
Population and Sample
Population : Set of all the possible observations for a given context of the problem. The size of population can be very large in many cases.
For example: All the members of Mathematical society.
Or, All daily maximum temperatures in July for major U.S. cities.
Sample : A sample is a smaller group of members of a population selected to represent the population. An incorrect sampling may result in bias and incorrect inference about the population.
Measure of Central Tendency
Measure of central tendency are the measure that are used for describing data using single value.
Mean, Median, and Mode are the three parameters of central tendency.
Mean
Mean is the arithmetic average value of data. Assume that data has n observation in a sample, let Xi be the value of ith observation.
Recommended by LinkedIn
Therefore mean is given as:
Median ( or Mid Value)
Median is the value that divides the data in two equal parts, that is , proportion of data below and above median will be 50%.
Mode
Mode is most frequently occurring value in a dataset.
Measure of Variance
One of the primary objective of data analytics is to understand the variability in data. Variability in data is measured using the following measures:
Range
It is the difference between the minimum and maximum value of data. It measures the spread of data.
Inter-Quartile Distance
Inter-Quartile Distance (IQD), also called Inter-Quartile Range, measures the distance between Quartile 1 (Q1) and Quartile 3 (Q3). IQD is useful measure to identify outliers in data.
Values of data below Q1–1.5IQD and above Q3+1.5IQD are classified as outliers.
Variance
Variance is the variability in data from the mean value. Variance for population σ2 is given by:
Standard Deviation
The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance.
Chebyshev’s Theorem
Chebyshev’s theorem (also called as Chebyshev’s inequality) is an empirical rule that allow us to predict proportion of observation that is likely to lie between an interval defined using mean and standard deviation.
Measure of Shape — Skewness and Kurtosis
Skewness : Skewness is asymmetry in a statistical distribution, in which the curve appears distorted or skewed either to the left or to the right. Skewness can be quantified to define the extent to which a distribution differs from a normal distribution.
Kurtosis : Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.
An excess kurtosis is a metric that compares the kurtosis of a distribution against the kurtosis of a normal distribution. The kurtosis of a normal distribution equals 3. Therefore, the excess kurtosis is found using the formula below:
Excess Kurtosis = Kurtosis — 3
Types of Kurtosis
1. Mesokurtic
Data that follows a mesokurtic distribution shows an excess kurtosis of zero or close to zero. It means that if the data follows a normal distribution, it follows a mesokurtic distribution.
2. Leptokurtic
Leptokurtic indicates a positive excess kurtosis distribution. The leptokurtic distribution shows heavy tails on either side, indicating the large outliers.
3. Platykurtic
A platykurtic distribution shows a negative excess kurtosis. The kurtosis reveals a distribution with flat tails. The flat tails indicate the small outliers in a distribution.
Summary
This post gave introduction to descriptive statistics. You learned about what is descriptive statistics and how this is important in any data science or analytical solution. You also learned about the data types and their measurement scale. Then we discussed the measure of central tendency, variance and measure of shape. Lastly, you learned about Leptokurtic, Mesokurtic and Platykurtic distributions and calculating each summary statistics in python.