Descriptive Statistics

Somya Rai

Published May 7, 2022

“I couldn’t claim that I was smarter than sixty-five other guys — but the average of sixty-five other guys, certainly!” ― Richard P. Feynman

Descriptive Analytics is about finding “what has happened” by summarizing the data using innovative methods and analyzing past data using simple queries. It is distinguished from inferential statistics (or inductive statistics), in that descriptive statistics aims to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent.

Introduction to Descriptive Statistics
Data Types and Scales
Types of Data Measurement Scales
Population and Sample
Measure of Central Tendency
Measure Of Variance
Measure of Shape

Introduction

Descriptive analytics is the starting point of analytics based solution to problems. It helps in understanding the data and provide the predictive and prescriptive analytics. It is very essential for data science professional to run the descriptive statistics and cull out the insights to do further analysis.

For example: retailer such as Target or Reliance would like to know top 5 products that are sold by region, by city, by store, etc.

Data Types and Scale

Data is classified into different categories based on data structure and scale of measurement of the variables.

Structured and Unstructured Data

Structured Data : Data that has originally been organized in form of rows and columns or formatted database.

Unstructured data (or unstructured information) : Any data that originally does not exist in matrix form like rows and columns.

One of the main challenge in dealing with unstructured data is converting it to structured form to do further analysis and model development.

For example: click stream data, e-mails, feedback data, survey data, etc.

Cross-Sectional, Time Series and Panel Data

Cross-Sectional Data : A data collected on many variables at the same time or during same time.

Example: Closing price of 30 different tech stock on 16th August 2019

Time Series Data : Data collected for single variable over several time interval.

For example: Monthly sales of soaps over a 5 year period

Panel Data : Data collected on several variables over several interval of time is called panel data. It is also called longitudinal data.

For example: Unemployment rate of different countries collected over several years.

Type of Data Measurement Scale

Structured data can be either numeric or alpha numeric and may follow different scales of measurement. It is very important to understand the data type of the variable in order to implement correct measurement scale.

Nominal Scale : A nominal scale measurement normally deals only with non-numeric (quantitative) variables or where numbers have no value.

Ordinal Scale : Ordinal scale reports the ranking and ordering of the data without actually establishing the degree of variation between them.

Interval Scale : The interval scale is defined as a quantitative measurement scale where the difference between 2 variables is meaningful. An easy way to remember interval scale is that subtraction is defined between the two variables.

Interval data can be discrete — with whole numbers like 8 degrees, 4 years, 2 months etc. or continuous — with fractional numbers like 12.2 degrees, 3.5 weeks or 4.2 miles.

Ratio Scale : Ratio scale is a type of variable measurement scale which is quantitative in nature. Ratio scale allows any researcher to compare the intervals or differences. It possesses a zero point or character of origin.

A ratio scale is the most informative scale as it tends to tell about the order and number of the object between the values of the scale.

For example, the temperature outside is 0-degree Celsius. 0 degree doesn’t mean it’s not hot or cold, it is a value.

Population and Sample

Population : Set of all the possible observations for a given context of the problem. The size of population can be very large in many cases.

For example: All the members of Mathematical society.

Or, All daily maximum temperatures in July for major U.S. cities.

Sample : A sample is a smaller group of members of a population selected to represent the population. An incorrect sampling may result in bias and incorrect inference about the population.

Measure of Central Tendency

Measure of central tendency are the measure that are used for describing data using single value.

Mean, Median, and Mode are the three parameters of central tendency.

Mean

Mean is the arithmetic average value of data. Assume that data has n observation in a sample, let Xi be the value of ith observation.

Median ( or Mid Value)

Median is the value that divides the data in two equal parts, that is , proportion of data below and above median will be 50%.

Mode

Mode is most frequently occurring value in a dataset.

Measure of Variance

One of the primary objective of data analytics is to understand the variability in data. Variability in data is measured using the following measures:

Range
Inter-Quartile Distance (IQD)
Variance
Standard Deviation

Range

It is the difference between the minimum and maximum value of data. It measures the spread of data.

Inter-Quartile Distance

Inter-Quartile Distance (IQD), also called Inter-Quartile Range, measures the distance between Quartile 1 (Q1) and Quartile 3 (Q3). IQD is useful measure to identify outliers in data.

Values of data below Q1–1.5IQD and above Q3+1.5IQD are classified as outliers.

Variance

Variance is the variability in data from the mean value. Variance for population σ2 is given by:

Standard Deviation

The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance.

Chebyshev’s Theorem

Chebyshev’s theorem (also called as Chebyshev’s inequality) is an empirical rule that allow us to predict proportion of observation that is likely to lie between an interval defined using mean and standard deviation.

Measure of Shape — Skewness and Kurtosis

Skewness : Skewness is asymmetry in a statistical distribution, in which the curve appears distorted or skewed either to the left or to the right. Skewness can be quantified to define the extent to which a distribution differs from a normal distribution.

Kurtosis : Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.

An excess kurtosis is a metric that compares the kurtosis of a distribution against the kurtosis of a normal distribution. The kurtosis of a normal distribution equals 3. Therefore, the excess kurtosis is found using the formula below:

Excess Kurtosis = Kurtosis — 3

Types of Kurtosis

1. Mesokurtic

Data that follows a mesokurtic distribution shows an excess kurtosis of zero or close to zero. It means that if the data follows a normal distribution, it follows a mesokurtic distribution.

2. Leptokurtic

Leptokurtic indicates a positive excess kurtosis distribution. The leptokurtic distribution shows heavy tails on either side, indicating the large outliers.

3. Platykurtic

A platykurtic distribution shows a negative excess kurtosis. The kurtosis reveals a distribution with flat tails. The flat tails indicate the small outliers in a distribution.

Summary

This post gave introduction to descriptive statistics. You learned about what is descriptive statistics and how this is important in any data science or analytical solution. You also learned about the data types and their measurement scale. Then we discussed the measure of central tendency, variance and measure of shape. Lastly, you learned about Leptokurtic, Mesokurtic and Platykurtic distributions and calculating each summary statistics in python.

To view or add a comment, sign in

Descriptive Statistics

Somya Rai

Table of Contents

Introduction

Data Types and Scale

Structured and Unstructured Data

Cross-Sectional, Time Series and Panel Data

Type of Data Measurement Scale

Population and Sample

Measure of Central Tendency

Mean

Recommended by LinkedIn

Median ( or Mid Value)

Mode

Measure of Variance

Range

Inter-Quartile Distance

Variance

Standard Deviation

Chebyshev’s Theorem

Measure of Shape — Skewness and Kurtosis

Types of Kurtosis

1. Mesokurtic

2. Leptokurtic

3. Platykurtic

Summary

More articles by Somya Rai

Others also viewed

Data Science vs BI - a quick guide

Data Analysis vs. Data Analytics: Understanding the Distinctions

Elevating Your Analytics Game: Essential Statistical Concepts Every Data Analyst Should Master

Unlock Data Science Success with These 5 Stats Concepts

Understanding the Types of Data Analytics: From Descriptive to Prescriptive

Foundations of Data Analytics

A Beginner's Guide to Data Analytics

How to Succeed in Data Analytics Projects: Challenges, Failures and Best Practices

DATA ANALYTICS

Unveiling the Power of Data Analytics: A Journey into Exploratory Data Analysis

Explore content categories

Table of Contents

Introduction

Data Types and Scale

Structured and Unstructured Data

Cross-Sectional, Time Series and Panel Data

Type of Data Measurement Scale

Population and Sample

Measure of Central Tendency

Mean

Recommended by LinkedIn

Median ( or Mid Value)

Mode

Measure of Variance

Range

Inter-Quartile Distance

Variance

Standard Deviation

Chebyshev’s Theorem

Measure of Shape — Skewness and Kurtosis

Types of Kurtosis

1. Mesokurtic

2. Leptokurtic

3. Platykurtic

Summary

More articles by Somya Rai

Maximising GPU Utilisation for LLM Inference: A Comprehensive Guide

Demystifying Sliding Window & Grouped Query Attention: A Simpler Approach to Efficient Neural Networks

Large Language Models and Hardware: A Comparative Study of CPUs, GPUs, and TPUs

Unpacking the Power of Generative AI: Insights from GCC X... Summit

Digital Transformation using AI — Insurance Industry

SetFit - Few Shot Learning Model with Ray

Working with Contact Centre Data

Others also viewed

Data Science vs BI - a quick guide

Data Analysis vs. Data Analytics: Understanding the Distinctions

Elevating Your Analytics Game: Essential Statistical Concepts Every Data Analyst Should Master

Unlock Data Science Success with These 5 Stats Concepts

Understanding the Types of Data Analytics: From Descriptive to Prescriptive

Foundations of Data Analytics

A Beginner's Guide to Data Analytics

How to Succeed in Data Analytics Projects: Challenges, Failures and Best Practices

DATA ANALYTICS

Unveiling the Power of Data Analytics: A Journey into Exploratory Data Analysis

Similar topics

Unstructured Data Collection in Health Research

Explore content categories