Estimates of Location - Basic Statistic for Data Science
Estimates of Location
There is an immensity of variables when it comes to data, and variables with measured or count data might have millions of values, and that is the reason why we need to locate most of our data and explore it with a "central value".
In order to achieve this and summarize our data, we could aim to just estimate the mean. But the mean is not always the best measure for a central value, and this is the reason why statisticians have determined several alternatives to the mean.
Mean
The mean is the most basic estimate in order to locate the central value for a variable in a set of data and in essence, is the sum of all our values divided by the total of values. Considering the following set of values: {1,5,6,8,3} the mean is (1+5+6+8+3)/5 = 4.6
Trimmed Mean
With "trimmed mean", we basically are calculating the mean by dropping a fixed number of ordered values at each end of your set of values and taking the average of the remaining values.
Considering the previous example but this time ordering our set of values we get the following: {1,3,5,6,8}. To trim the mean by a total of 40%, we remove the lowest 20% and the highest 20% of values, eliminating the scores of 1 and 8 so the mean is (3+5+6)/3 = 4.7.
A trimmed mean allows us eliminates extremes values that can greatly affect the behavior of our data. For example, in international diving, the highest and lowest score are discarded from 5 judges and are only considered the average score from the 3 judges remaining. This makes it difficult for a single judge to manipulate the score, perhaps to favor their country contestant.
Weighted Mean
The weighted mean is calculated by multiplying each value by a weighted value divided by the sum of the weighted values.
Two main reasons for using weighted mean:
Median and Robust Estimates
The median is the middle value of a set of ordered values. This estimate of location might look a disadvantage compared to the mean since the mean considers all the set of values and is more sensitive to the data. But just suppose that we want to compare the household incomes in neighborhoods around Lake Washington to household incomes in Mediana neighborhoods. Using the mean as an estimate we will find a huge difference between both due to Bill Gates lives in Medina, and for obvious reasons, this would affect our result. But If we use the median, it does not care how rich Bill Gates is, the position of the middle observation will remain the same.
Outliers
Going back to our previous example, an outlier in our data set would be the incomes of Bill Gates, since the outliers are those values that are very distant from the rest values of our data set. Here the median is still valid as the trimmed median. Outliers are often the results of mixed data (for instance mixing up kilometers with meters). And are usually worthy of further investigation.
Example of Estimates of Locations with Python
#Importing Libraries:
Recommended by LinkedIn
import pandas as pd
import numpy as np
from scipy.stats import trim_mean
#Reading Data
>>> state = pd.read_csv('/content/state.csv'
>>> print(state.head(8)))
#output
State Population Murder.Rate Abbreviati
0 Alabama 4779736 5.7 AL
1 Alaska 710231 5.6 AK
2 Arizona 6392017 4.7 AZ
3 Arkansas 2915918 5.6 AR
4 California 37253956 4.4 CA
5 Colorado 5029196 2.8 CO
6 Connecticut 3574097 2.4 CT
7 Delaware 897934 5.8 DEno
#Compute the mean, trimmed mean, and median for Population. For mean and median we can use the pandas methods of the data frame. The trimmed mean requires the trim_mean function in scipy.stats.
Mean
>>> print(state['Population'].mean())
6162876.3
Median
>>> print(state['Population'].median())
4436369.5
Trimmed Median
>>> print(trim_mean(state['Population'], 0.1))
4783697.125
Weighted Mean
Weighted mean is available with numpy.
print(state['Murder.Rate'].mean())
4.066
print(np.average(state['Murder.Rate'], weights=state['Population']))
4.445833981123393
The mean is bigger than the trimmed mean, which is bigger than the median. This is because the trimmed mean excludes the largest and smallest five states (trim=0.1 drops 10% from each end). If we want to compute the average murder rate for the country, we need to use a weighted mean or median to account for different populations in the states.
Reference: This was a resume from "Practical Statistics for Data Scientists" by Peter Bruce, Andrew Bruce & Peter Gedeck, 2nd Edition. Chapter 1, Estimates of location.
.