Simplifying the concepts of Basic Statistics (Part II)

Simplifying the concepts of Basic Statistics (Part II)

Hey guys! howdy!

In last part I discussed about Basic Statistical concepts like Outliers ,Measures of Central Tendency, Mean, Median, Mode.

In this part , I am going to discuss about Statistical Measures of Dispersion and will see what are the relations with outliers in a easy way for beginners.

*You can read Simplifying the concepts of Basic Statistics (Part I) here.*

Measures of Dispersion:

Measures of dispersion provide information about the spread of a variable’s values. It is also known as Measures of Variance or Spread.

# Range:

In statistics, the range is the spread of your data from the lowest to the highest value in the distribution. It is a commonly used measure of variability.

The range is calculated by subtracting the lowest value from the highest value. While a large range means high variability, a small range means low variability in a distribution.

The formula to calculate the range is:

No alt text provided for this image


The range is the easiest measure of variability to calculate. To find the range, follow these steps:

  1. Order all values in your data set from low to high.
  2. Subtract the lowest value from the highest value.

This process is the same regardless of whether your values are positive or negative, or whole numbers or fractions.

Example:

Your data set is the ages of 8 participants.

No alt text provided for this image

First, order the values from low to high to identify the lowest value (L) and the highest value (H).

No alt text provided for this image

Then subtract the lowest from the highest value.

R = HL

R = 37–19 = 18

The range of our data set is 18 years.

Impact of Outliers on Range: Range has huge impact of outliers.

If there is an outlier value in above data set , say 115 years as 9th person’s age.

Age(ordered) : 19, 21, 26, 29, 31, 33, 36, 37, 115

Then subtracting the lowest from the highest value.

R = HL

R = 115–19 = 96

The range of our data set is 96 years.

See, there is a difference of (96–18)=78…. that much value increased. Which is not normal.

No alt text provided for this image

# Quantiles:

Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile than the number of groups created. Common quantiles have special names, such as quartiles (four groups), deciles (ten groups), and percentiles (100 groups). The groups created are termed halves, thirds, quarters, etc., though sometimes the terms for the quantile are used for the groups created, rather than for the cut points.

No alt text provided for this image

# Percentiles:

Percentile is a measure that indicates the value below which a given percentage of observations in a group of observations falls. It represents position of a values in data set.

To calculate percentile, values in data set should always be in ascending order.

Example:

Data set( ordered in ascending) : 12, 24, 41, 51, 67, 67, 85, 99

Total number of responses: n= 8

Middle numbers: 51, 67

Median: Find the mean of the two middle numbers: (51+ 67)/2 = 59.65

The median 59.65 has 4 values less than itself out of 8.

It can also be said as: In data set, 59.65 is 50th percentile because 50% of the total terms are less than 59.65. In general, if k is nth percentile, it implies that n% of the total terms are less than k.

Impact of Outliers on Percentile: Percentile has less impact of outliers.

If there is an outlier value in above data set , say 321 as 9th observation.

Data set( ordered in ascending) : 12, 24, 41, 51, 67, 67, 85, 99, 321

Total number of responses: n= 9

Middle numbers: 67

Median: 67

See, here 67 is the 50th percentile. 321 is the 100th percentile as rest of the terms are less than 321. So, seeing the difference we can detect the outliers and eliminate.

No alt text provided for this image

# Interquartile Range (IQR):

The interquartile range (IQR), also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1. In other words, the IQR is the first quartile subtracted from the third quartile.

To find the interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.

Example:

Data set( ordered in ascending) : 12, 24, 41, 51, 67, 67, 85, 99, 115

Total number of responses: n= 9

Middle numbers: 67 (Median).

Q2 = 67: is 50 percentile of the whole data and is median.

Q1 = 41: is 25 percentile of the data.

Q3 = 85: is 75 percentile of the date.

Interquartile range (IQR) = Q3 — Q1 = 85–41 = 44

Note: If you sort data in descending order, IQR will be -44. The magnitude will be same, just sign will differ. Negative IQR is fine, if your data is in descending order. It just we negate smaller values from larger values, we prefer ascending order (Q3 — Q1).

Impact of Outliers on IQR: The Interquartile Range is Not Affected By Outliers.

Since the IQR is simply the range of the middle 50% of data values, it’s not affected by extreme outliers.

No alt text provided for this image

# Variance:

The variance is the average of squared deviations from the mean. Variance reflects the degree of spread in the data set. The more spread the data, the larger the variance is in relation to the mean.

To find the variance, simply square the standard deviation. The symbol for variance is .

No alt text provided for this image

Steps for calculating the variance:

Data set: 46, 69, 32, 60, 52, 41

Total number of responses: n= 6

Step1: Find the mean: add up all the scores, then divide them by the number of scores.

No alt text provided for this image

Step 2: Find each score’s deviation from the mean.

Subtract the mean from each score to get the deviations from the mean.

Since x̅ = 50, take away 50 from each score.

No alt text provided for this image

Step 3: Square each deviation from the mean.

Multiply each deviation from the mean by itself. This will result in positive numbers.

No alt text provided for this image

Step 4: Find the sum of squares.

Add up all of the squared deviations. This is called the sum of squares.

No alt text provided for this image

Step 5: Divide the sum of squares by n — 1 or N

Divide the sum of the squares by n — 1 (for a sample) or N (for a population).

Since we’re working with a sample, we’ll use n — 1, where n = 6.

No alt text provided for this image

Impact of Outliers on Variance: The sample variance is even more sensitive to outliers . That means if there are some outliers in the data set, variance will change abnormally.

# Standard Deviation:

The standard deviation is the average amount of variability in your dataset. It tells you, on average, how far each value lies from the mean.

A high standard deviation means that values are generally far from the mean, while a low standard deviation indicates that values are clustered close to the mean.

There are two types of standard deviation.

1.Population standard deviation:

When you have collected data from every member of the population that you’re interested in, you can get an exact value for population standard deviation.

The population standard deviation formula looks like this:

No alt text provided for this image

2.Sample standard deviation:

When you collect data from a sample, the sample standard deviation is used to make estimates or inferences about the population standard deviation.

The sample standard deviation formula looks like this:

No alt text provided for this image

Steps for calculating the standard deviation:

Data set: 46, 69, 32, 60, 52, 41

Total number of responses: n= 6

Step 1: Find the mean

Step 2: Find each score’s deviation from the mean

Step 3: Square each deviation from the mean

Step 4: Find the sum of squares

Step 5: Find the variance

As I am taking same data set as above example of Variance , Step 1 to 5 is already done and we found the variance=177.2

Step 6: Find the square root of the variance

To find the standard deviation, we take the square root of the variance.

No alt text provided for this image

From learning that SD = 13.31, we can say that each score deviates from the mean by 13.31 points on average.

Impact of Outliers on Standard Deviation: Standard Deviation is also sensitive to outliers like variance. A single outlier can raise the standard deviation and in turn, distort the picture of spread.

No alt text provided for this image

# Median Absolute Deviation(MAD):

The median absolute deviation(MAD) is a robust measure of how spread out a set of data is. The variance and standard deviation are also measures of spread, but they are more affected by extremely high or extremely low values and non normality. If your data is normal, the standard deviation is usually the best choice for assessing spread. However, if your data isn’t normal, the MAD is one statistic you can use instead.

No alt text provided for this image

Example:

Data set: 3, 8, 8, 8, 8, 9, 9, 9, 9.

Total number of responses: n= 9

Step 1: Find the median. The median for this set of numbers is 8.

Step 2: Subtract the absolute median from each x-value .

|3–8| = 5

|8–8| = 0

|8–8| = 0

|8–8| = 0

|8–8| = 0

|9–8| = 1

|9–8| = 1

|9–8| = 1

|9–8| = 1

Step 3: Find the median of the absolute differences. The median of the differences (0,0,0,0,1,1,1,1,5) is 1.

Impact of Outliers on Median Absolute Deviation: The Median Absolute Deviation is the most robust dispersion/scale measure in presence of outliers. Means there is no effect or less effect of outliers on MAD.

Measures of Association:

#Correlation:

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related.

The main result of a correlation is called the correlation coefficient (or “r”). It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related.

If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an “inverse” correlation).

# Covariance:

A quantitative measure of the joint variability between two or more variables.

No alt text provided for this image
No alt text provided for this image


I think this article goes bit longer, but it will clear all basic ideas about Descriptive Statistics.

*You can read Simplifying the concepts of Basic Statistics (Part I) here.*

Read and review. Your valuable feedback will encourage me to write more intensive articles.

Happy Learning!

To view or add a comment, sign in

More articles by Subhajit Mondal

Others also viewed

Explore content categories