Determining the central tendency in case of a skewed distribution
In statistics, a skewed distribution refers to the asymmetry or lack of symmetry in a probability distribution. It occurs when the values in a dataset are not evenly distributed around the mean.
A skewed distribution is characterised by a longer tail on one side of the distribution compared to the other. The direction of the skewness is determined by the tail's position. If the tail extends towards the right (positive values), it is called a right-skewed distribution or positively skewed. On the other hand, if the tail extends towards the left (negative values), it is called a left-skewed distribution or negatively skewed.
In a right-skewed distribution, the mean is typically greater than the median, and the majority of the data is concentrated on the left side. Examples of right-skewed distributions include income distribution (where a few high-income individuals pull the mean upward), exam scores (where a few high-scoring students increase the mean), and stock returns (where large positive returns can skew the distribution).
In a left-skewed distribution, the mean is usually less than the median, and the majority of the data is concentrated on the right side. Examples of left-skewed distributions include the distribution of prices for certain goods (where there may be a lower limit on prices) or the distribution of response times in a task (where there may be a minimum time required to complete the task).
Now, talking about average or in statistics as termed as Central Tendency for any distribution mainly average is found using either of Mean or Median. In a skewed distribution, the mean and median can differ significantly from each other because they are influenced differently by the tail of the distribution.
Therefore, when dealing with a skewed distribution, the mean can be misleading as a measure of central tendency because it can be heavily influenced by extreme values. The median is often considered a more robust measure in such cases. However, there are cases where the median may not accurately represent the central tendency in highly skewed distributions. Here are a few scenarios where the median may "fail" to provide an accurate measure:
In summary, while the median is generally more robust in skewed distributions, there can be cases, particularly in highly skewed distributions with extreme tail asymmetry or influential outliers, where it may not accurately reflect the central tendency.
To overcome this CLT or Central Limit Theorem is used.
The Central Limit Theorem (CLT) is not directly used for finding central tendency; instead, it is used to make inferences about the population mean when sampling from a population. However, the CLT indirectly relates to central tendency through the concept of the sampling distribution of the mean.
Recommended by LinkedIn
The CLT states that when independent random variables are added, regardless of the shape of their individual distributions, the sum tends to follow a normal distribution as the sample size increases. More specifically, the CLT states that the sampling distribution of the mean (the distribution of sample means taken from multiple samples) approaches a normal distribution, regardless of the shape of the population distribution, as the sample size increases.
This has implications for estimating the population mean, which is a measure of central tendency. By taking multiple random samples from a population and calculating the mean of each sample, the CLT suggests that the distribution of these sample means will be approximately normally distributed, even if the population distribution is not. The mean of the sample means will be close to the population mean, and the standard deviation of the sample means (known as the standard error) will decrease as the sample size increases.
Therefore, the CLT allows us to use the sample mean as an estimator of the population mean. The advantage of using the sample mean as an estimator of central tendency is that it is an unbiased estimator, meaning that, on average, it provides an accurate estimate of the population mean. Additionally, it has desirable statistical properties, such as efficiency and consistency.
CLT can be easily implemented in Python.
Let us take below data distribution which is highly skewed.
The mean of the above distribution is 0 and median is close to 0.7. By Using CLT we can pick samples from this data distribution and find its mean. When the means of all samples are plotted they follow Normal Distribution and upon finding the average (i.e. mean of sample means) it gives us the realistic central tendency of the distribution.
totalSamples = 100000
size = len(listElements) // 2
sample_means = np.empty(totalSamples)
for iterator in range(totalSamples):
bs_sample = np.random.choice(listElements, size = size)
sample_means[iterator] = np.mean(bs_sample))
When the means of all samples are plotted they look like below.
We can clearly see that the mean of the distribution lies at 1.04. Therefore, 1.04 is the Central Tendency value. Though, median can also be used but here CLT provides more realistic value about central tendency.
In addition, CLT can also be applied to other summary statistics besides the mean. Instead of calculating the means of all samples, we can calculate the medians or other quantile values (such as the 95th or 99.5th percentile). When these medians or quantile values are plotted, they can exhibit a distribution that approximates a normal distribution. By finding the mean, median, confidence interval, or other summary statistics of these medians or quantile values, we can estimate the central tendency of the distribution. This approach allows us to infer information about the population based on the distribution of medians or quantile values from multiple samples. This is called Bootstrapping.