Determining the central tendency in case of a skewed distribution

Harpreet Singh Sachdev

Published Jul 8, 2023

In statistics, a skewed distribution refers to the asymmetry or lack of symmetry in a probability distribution. It occurs when the values in a dataset are not evenly distributed around the mean.

A skewed distribution is characterised by a longer tail on one side of the distribution compared to the other. The direction of the skewness is determined by the tail's position. If the tail extends towards the right (positive values), it is called a right-skewed distribution or positively skewed. On the other hand, if the tail extends towards the left (negative values), it is called a left-skewed distribution or negatively skewed.

In a right-skewed distribution, the mean is typically greater than the median, and the majority of the data is concentrated on the left side. Examples of right-skewed distributions include income distribution (where a few high-income individuals pull the mean upward), exam scores (where a few high-scoring students increase the mean), and stock returns (where large positive returns can skew the distribution).

In a left-skewed distribution, the mean is usually less than the median, and the majority of the data is concentrated on the right side. Examples of left-skewed distributions include the distribution of prices for certain goods (where there may be a lower limit on prices) or the distribution of response times in a task (where there may be a minimum time required to complete the task).

Now, talking about average or in statistics as termed as Central Tendency for any distribution mainly average is found using either of Mean or Median. In a skewed distribution, the mean and median can differ significantly from each other because they are influenced differently by the tail of the distribution.

Mean: The mean is calculated by summing up all the values in the dataset and dividing it by the total number of values. In a skewed distribution, the presence of extreme values in the longer tail can pull the mean in that direction. As a result, the mean is sensitive to outliers and can be heavily influenced by them. If there are a few extremely high or low values in the tail, the mean can be distorted and may not represent the central tendency of the majority of the data.
Median: The median is the middle value in a dataset when it is arranged in ascending or descending order. Unlike the mean, the median is less affected by extreme values or outliers because it only considers the value in the middle. In a skewed distribution, the median tends to be less influenced by the tail and provides a better measure of the central tendency. It represents the value below and above which 50% of the data lies.

Therefore, when dealing with a skewed distribution, the mean can be misleading as a measure of central tendency because it can be heavily influenced by extreme values. The median is often considered a more robust measure in such cases. However, there are cases where the median may not accurately represent the central tendency in highly skewed distributions. Here are a few scenarios where the median may "fail" to provide an accurate measure:

Tail asymmetry: In highly skewed distributions, the tail can be extremely elongated, with a large number of values concentrated on one side. In such cases, the median may not adequately capture the distribution's central tendency because it is determined solely by the position of the middle value, without considering the weight or density of values in the tail.
Outliers: While the median is less sensitive to outliers compared to the mean, extremely high or low values can still impact the median in highly skewed distributions. If there are influential outliers present, the median may be pulled towards the direction of the outliers, compromising its representativeness.
Skewness type: The effectiveness of the median can also depend on the type of skewness. If the skewness is due to a long tail on the right (right-skewed), the median can still be a reasonable measure. However, if the skewness is caused by a long tail on the left (left-skewed), the median may not accurately represent the central tendency because it is determined solely by the middle value(s).

In summary, while the median is generally more robust in skewed distributions, there can be cases, particularly in highly skewed distributions with extreme tail asymmetry or influential outliers, where it may not accurately reflect the central tendency.

To overcome this CLT or Central Limit Theorem is used.

The Central Limit Theorem (CLT) is not directly used for finding central tendency; instead, it is used to make inferences about the population mean when sampling from a population. However, the CLT indirectly relates to central tendency through the concept of the sampling distribution of the mean.

Recommended by LinkedIn

Story of central tendency methods and their…

Ruvva Pujitha 1 year ago

Eeny, MEAN-y, MEDIAN-y, MODE (Location and Spread…

David Tomczyk 5 years ago

Differences Between the Normal and Poisson…

The Analysis Factor 8 months ago

The CLT states that when independent random variables are added, regardless of the shape of their individual distributions, the sum tends to follow a normal distribution as the sample size increases. More specifically, the CLT states that the sampling distribution of the mean (the distribution of sample means taken from multiple samples) approaches a normal distribution, regardless of the shape of the population distribution, as the sample size increases.

This has implications for estimating the population mean, which is a measure of central tendency. By taking multiple random samples from a population and calculating the mean of each sample, the CLT suggests that the distribution of these sample means will be approximately normally distributed, even if the population distribution is not. The mean of the sample means will be close to the population mean, and the standard deviation of the sample means (known as the standard error) will decrease as the sample size increases.

Therefore, the CLT allows us to use the sample mean as an estimator of the population mean. The advantage of using the sample mean as an estimator of central tendency is that it is an unbiased estimator, meaning that, on average, it provides an accurate estimate of the population mean. Additionally, it has desirable statistical properties, such as efficiency and consistency.

CLT can be easily implemented in Python.

Let us take below data distribution which is highly skewed.

The mean of the above distribution is 0 and median is close to 0.7. By Using CLT we can pick samples from this data distribution and find its mean. When the means of all samples are plotted they follow Normal Distribution and upon finding the average (i.e. mean of sample means) it gives us the realistic central tendency of the distribution.

totalSamples = 100000
size = len(listElements) // 2
sample_means = np.empty(totalSamples)


for iterator in range(totalSamples):
    bs_sample = np.random.choice(listElements, size = size)
    sample_means[iterator] = np.mean(bs_sample))

When the means of all samples are plotted they look like below.

We can clearly see that the mean of the distribution lies at 1.04. Therefore, 1.04 is the Central Tendency value. Though, median can also be used but here CLT provides more realistic value about central tendency.

In addition, CLT can also be applied to other summary statistics besides the mean. Instead of calculating the means of all samples, we can calculate the medians or other quantile values (such as the 95th or 99.5th percentile). When these medians or quantile values are plotted, they can exhibit a distribution that approximates a normal distribution. By finding the mean, median, confidence interval, or other summary statistics of these medians or quantile values, we can estimate the central tendency of the distribution. This approach allows us to infer information about the population based on the distribution of medians or quantile values from multiple samples. This is called Bootstrapping.

To view or add a comment, sign in

Determining the central tendency in case of a skewed distribution

Harpreet Singh Sachdev

Recommended by LinkedIn

More articles by Harpreet Singh Sachdev

Others also viewed

Enough is Enough: Dispositive Thresholds in Data

Data & Statistics in the Fire Service

BASIC LITERACY OF STATISTICS

Business Statistics - Chebyshev's Theorem

Statistics Sweden's New Strategy

How to lie with Statistics?

4- What are Numbers?

Are you Data Illiterate?

Descriptive Statistics-2

The Effect Size: The Most Difficult Step in Calculating Sample Size Estimates

Explore content categories

Recommended by LinkedIn

More articles by Harpreet Singh Sachdev

Edge AI Strategies: C++ vs. Python for Smarter Solutions

Customizing Foundational LLM models : A Comprehensive Guide

The "Expert" Button: Can AI Really Magic Up Knowledge?

AI Drift In Retrieval Augmented Generation and ways to control it

Large Language Models (LLMs): Understanding and Optimizing for Programmatic Use

Sim Swap Fraud

Machine Learning vs Deep Learning

Choosing number of Hidden Layers and number of hidden neurons in Neural Networks

Tensorflow vs Keras vs PyTorch vs Theano

How does Google Search Mechanism work?

Others also viewed

Enough is Enough: Dispositive Thresholds in Data

Data & Statistics in the Fire Service

BASIC LITERACY OF STATISTICS

Business Statistics - Chebyshev's Theorem

Statistics Sweden's New Strategy

How to lie with Statistics?

4- What are Numbers?

Are you Data Illiterate?

Descriptive Statistics-2

The Effect Size: The Most Difficult Step in Calculating Sample Size Estimates

Explore content categories