Simplifying the concepts of Basic Statistics (Part II)

Subhajit Mondal

Published May 26, 2021

Hey guys! howdy!

In last part I discussed about Basic Statistical concepts like Outliers ,Measures of Central Tendency, Mean, Median, Mode.

In this part , I am going to discuss about Statistical Measures of Dispersion and will see what are the relations with outliers in a easy way for beginners.

*You can read Simplifying the concepts of Basic Statistics (Part I) here.*

Measures of Dispersion:

Measures of dispersion provide information about the spread of a variable’s values. It is also known as Measures of Variance or Spread.

# Range:

In statistics, the range is the spread of your data from the lowest to the highest value in the distribution. It is a commonly used measure of variability.

The range is calculated by subtracting the lowest value from the highest value. While a large range means high variability, a small range means low variability in a distribution.

The formula to calculate the range is:

The range is the easiest measure of variability to calculate. To find the range, follow these steps:

Order all values in your data set from low to high.
Subtract the lowest value from the highest value.

This process is the same regardless of whether your values are positive or negative, or whole numbers or fractions.

Example:

Your data set is the ages of 8 participants.

First, order the values from low to high to identify the lowest value (L) and the highest value (H).

Then subtract the lowest from the highest value.

R = H — L

R = 37–19 = 18

The range of our data set is 18 years.

Impact of Outliers on Range: Range has huge impact of outliers.

If there is an outlier value in above data set , say 115 years as 9th person’s age.

Age(ordered) : 19, 21, 26, 29, 31, 33, 36, 37, 115

Then subtracting the lowest from the highest value.

R = H — L

R = 115–19 = 96

The range of our data set is 96 years.

See, there is a difference of (96–18)=78…. that much value increased. Which is not normal.

# Quantiles:

Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile than the number of groups created. Common quantiles have special names, such as quartiles (four groups), deciles (ten groups), and percentiles (100 groups). The groups created are termed halves, thirds, quarters, etc., though sometimes the terms for the quantile are used for the groups created, rather than for the cut points.

# Percentiles:

Percentile is a measure that indicates the value below which a given percentage of observations in a group of observations falls. It represents position of a values in data set.

To calculate percentile, values in data set should always be in ascending order.

Example:

Data set( ordered in ascending) : 12, 24, 41, 51, 67, 67, 85, 99

Total number of responses: n= 8

Middle numbers: 51, 67

Median: Find the mean of the two middle numbers: (51+ 67)/2 = 59.65

The median 59.65 has 4 values less than itself out of 8.

It can also be said as: In data set, 59.65 is 50th percentile because 50% of the total terms are less than 59.65. In general, if k is nth percentile, it implies that n% of the total terms are less than k.

Impact of Outliers on Percentile: Percentile has less impact of outliers.

If there is an outlier value in above data set , say 321 as 9th observation.

Data set( ordered in ascending) : 12, 24, 41, 51, 67, 67, 85, 99, 321

Total number of responses: n= 9

Middle numbers: 67

Median: 67

See, here 67 is the 50th percentile. 321 is the 100th percentile as rest of the terms are less than 321. So, seeing the difference we can detect the outliers and eliminate.

# Interquartile Range (IQR):

The interquartile range (IQR), also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1. In other words, the IQR is the first quartile subtracted from the third quartile.

To find the interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.

Example:

Data set( ordered in ascending) : 12, 24, 41, 51, 67, 67, 85, 99, 115

Total number of responses: n= 9

Middle numbers: 67 (Median).

Q2 = 67: is 50 percentile of the whole data and is median.

Q1 = 41: is 25 percentile of the data.

Q3 = 85: is 75 percentile of the date.

Interquartile range (IQR) = Q3 — Q1 = 85–41 = 44

Note: If you sort data in descending order, IQR will be -44. The magnitude will be same, just sign will differ. Negative IQR is fine, if your data is in descending order. It just we negate smaller values from larger values, we prefer ascending order (Q3 — Q1).

Impact of Outliers on IQR: The Interquartile Range is Not Affected By Outliers.

Since the IQR is simply the range of the middle 50% of data values, it’s not affected by extreme outliers.

# Variance:

The variance is the average of squared deviations from the mean. Variance reflects the degree of spread in the data set. The more spread the data, the larger the variance is in relation to the mean.

To find the variance, simply square the standard deviation. The symbol for variance is s².

Steps for calculating the variance:

Data set: 46, 69, 32, 60, 52, 41

Total number of responses: n= 6

Step1: Find the mean: add up all the scores, then divide them by the number of scores.

Step 2: Find each score’s deviation from the mean.

Subtract the mean from each score to get the deviations from the mean.

Since x̅ = 50, take away 50 from each score.

Step 3: Square each deviation from the mean.

Multiply each deviation from the mean by itself. This will result in positive numbers.

Step 4: Find the sum of squares.

Add up all of the squared deviations. This is called the sum of squares.

Step 5: Divide the sum of squares by n — 1 or N

Divide the sum of the squares by n — 1 (for a sample) or N (for a population).

Since we’re working with a sample, we’ll use n — 1, where n = 6.

Impact of Outliers on Variance: The sample variance is even more sensitive to outliers . That means if there are some outliers in the data set, variance will change abnormally.

# Standard Deviation:

The standard deviation is the average amount of variability in your dataset. It tells you, on average, how far each value lies from the mean.

A high standard deviation means that values are generally far from the mean, while a low standard deviation indicates that values are clustered close to the mean.

There are two types of standard deviation.

1.Population standard deviation:

When you have collected data from every member of the population that you’re interested in, you can get an exact value for population standard deviation.

The population standard deviation formula looks like this:

2.Sample standard deviation:

When you collect data from a sample, the sample standard deviation is used to make estimates or inferences about the population standard deviation.

The sample standard deviation formula looks like this:

Steps for calculating the standard deviation:

Data set: 46, 69, 32, 60, 52, 41

Total number of responses: n= 6

Step 1: Find the mean

Step 2: Find each score’s deviation from the mean

Step 3: Square each deviation from the mean

Step 4: Find the sum of squares

Step 5: Find the variance

As I am taking same data set as above example of Variance , Step 1 to 5 is already done and we found the variance=177.2

Step 6: Find the square root of the variance

To find the standard deviation, we take the square root of the variance.

From learning that SD = 13.31, we can say that each score deviates from the mean by 13.31 points on average.

Impact of Outliers on Standard Deviation: Standard Deviation is also sensitive to outliers like variance. A single outlier can raise the standard deviation and in turn, distort the picture of spread.

# Median Absolute Deviation(MAD):

The median absolute deviation(MAD) is a robust measure of how spread out a set of data is. The variance and standard deviation are also measures of spread, but they are more affected by extremely high or extremely low values and non normality. If your data is normal, the standard deviation is usually the best choice for assessing spread. However, if your data isn’t normal, the MAD is one statistic you can use instead.

Example:

Data set: 3, 8, 8, 8, 8, 9, 9, 9, 9.

Total number of responses: n= 9

Step 1: Find the median. The median for this set of numbers is 8.

Step 2: Subtract the absolute median from each x-value .

|3–8| = 5

|8–8| = 0

|9–8| = 1

Step 3: Find the median of the absolute differences. The median of the differences (0,0,0,0,1,1,1,1,5) is 1.

Impact of Outliers on Median Absolute Deviation: The Median Absolute Deviation is the most robust dispersion/scale measure in presence of outliers. Means there is no effect or less effect of outliers on MAD.

Measures of Association:

#Correlation:

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related.

The main result of a correlation is called the correlation coefficient (or “r”). It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related.

If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an “inverse” correlation).

# Covariance:

A quantitative measure of the joint variability between two or more variables.

I think this article goes bit longer, but it will clear all basic ideas about Descriptive Statistics.

*You can read Simplifying the concepts of Basic Statistics (Part I) here.*

Read and review. Your valuable feedback will encourage me to write more intensive articles.

Happy Learning!

D'Mia . 4y

Thanks for sharing

Anuja Jain 4y

Thanks for sharing

See more comments

To view or add a comment, sign in

Simplifying the concepts of Basic Statistics (Part II)

Subhajit Mondal

Measures of Dispersion:

# Range:

# Quantiles:

# Percentiles:

# Interquartile Range (IQR):

# Variance:

# Standard Deviation:

Steps for calculating the standard deviation:

# Median Absolute Deviation(MAD):

Measures of Association:

#Correlation:

# Covariance:

More articles by Subhajit Mondal

Others also viewed

Can Data Science Make Me a Billionaire?

Descriptive Statistics-2

Forecasting CPI with Prophet

Stock price analyzing app ( Chapter 02 )

Compare Date and Time

Ali Moulaye, distributions, and last digit analysis

Dynamic Axis selections using parameters in Tableau

Work with Your Data

Are your Charts telling the right story

Explore content categories

Measures of Dispersion:

# Range:

# Quantiles:

# Percentiles:

# Interquartile Range (IQR):

# Variance:

# Standard Deviation:

Steps for calculating the standard deviation:

# Median Absolute Deviation(MAD):

Measures of Association:

#Correlation:

# Covariance:

More articles by Subhajit Mondal

MASTERCLASS #TASK ON DOCKER

A Tutorial on Ridge and Lasso Regression

Introduction to Cost Function

Hypothesis Testing: A brief introduction to Accept or Reject Your Hypothesis Using the p-value

MASTERCLASS #TASK ON GIT & GITHUB

MASTERCLASS #TASK ON LINUX