Estimates of Location - Basic Statistic for Data Science

Estimates of Location - Basic Statistic for Data Science

Estimates of Location

There is an immensity of variables when it comes to data, and variables with measured or count data might have millions of values, and that is the reason why we need to locate most of our data and explore it with a "central value".

In order to achieve this and summarize our data, we could aim to just estimate the mean. But the mean is not always the best measure for a central value, and this is the reason why statisticians have determined several alternatives to the mean.

Mean

The mean is the most basic estimate in order to locate the central value for a variable in a set of data and in essence, is the sum of all our values divided by the total of values. Considering the following set of values: {1,5,6,8,3} the mean is (1+5+6+8+3)/5 = 4.6

Trimmed Mean

With "trimmed mean", we basically are calculating the mean by dropping a fixed number of ordered values at each end of your set of values and taking the average of the remaining values.

Considering the previous example but this time ordering our set of values we get the following: {1,3,5,6,8}. To trim the mean by a total of 40%, we remove the lowest 20% and the highest 20% of values, eliminating the scores of 1 and 8 so the mean is (3+5+6)/3 = 4.7.

A trimmed mean allows us eliminates extremes values that can greatly affect the behavior of our data. For example, in international diving, the highest and lowest score are discarded from 5 judges and are only considered the average score from the 3 judges remaining. This makes it difficult for a single judge to manipulate the score, perhaps to favor their country contestant.

Weighted Mean

The weighted mean is calculated by multiplying each value by a weighted value divided by the sum of the weighted values.

Two main reasons for using weighted mean:

  1. Some values are more variable than others and highly variable observations are given a lower weight.
  2. The data collected does not represent the different groups that we are interested in measuring. In order to fix this, we could give higher weighted values to all those groups underrepresented.

Median and Robust Estimates

The median is the middle value of a set of ordered values. This estimate of location might look a disadvantage compared to the mean since the mean considers all the set of values and is more sensitive to the data. But just suppose that we want to compare the household incomes in neighborhoods around Lake Washington to household incomes in Mediana neighborhoods. Using the mean as an estimate we will find a huge difference between both due to Bill Gates lives in Medina, and for obvious reasons, this would affect our result. But If we use the median, it does not care how rich Bill Gates is, the position of the middle observation will remain the same.

Outliers

Going back to our previous example, an outlier in our data set would be the incomes of Bill Gates, since the outliers are those values that are very distant from the rest values of our data set. Here the median is still valid as the trimmed median. Outliers are often the results of mixed data (for instance mixing up kilometers with meters). And are usually worthy of further investigation. 

Example of Estimates of Locations with Python

#Importing Libraries:

import pandas as pd
import numpy as np
from scipy.stats import trim_mean

        

#Reading Data

>>> state = pd.read_csv('/content/state.csv'
>>> print(state.head(8)))

        

#output

         State  Population  Murder.Rate Abbreviati
0      Alabama     4779736          5.7           AL
1       Alaska      710231          5.6           AK
2      Arizona     6392017          4.7           AZ
3     Arkansas     2915918          5.6           AR
4   California    37253956          4.4           CA
5     Colorado     5029196          2.8           CO
6  Connecticut     3574097          2.4           CT
7     Delaware      897934          5.8           DEno        

#Compute the mean, trimmed mean, and median for Population. For mean and median we can use the pandas methods of the data frame. The trimmed mean requires the trim_mean function in scipy.stats.

Mean

>>> print(state['Population'].mean())
6162876.3

        

Median

>>> print(state['Population'].median())
4436369.5
        

Trimmed Median

>>> print(trim_mean(state['Population'], 0.1))
4783697.125

        

Weighted Mean

Weighted mean is available with numpy. 

print(state['Murder.Rate'].mean())
4.066


print(np.average(state['Murder.Rate'], weights=state['Population']))
4.445833981123393


        

The mean is bigger than the trimmed mean, which is bigger than the median. This is because the trimmed mean excludes the largest and smallest five states (trim=0.1 drops 10% from each end). If we want to compute the average murder rate for the country, we need to use a weighted mean or median to account for different populations in the states.

Reference: This was a resume from "Practical Statistics for Data Scientists" by Peter Bruce, Andrew Bruce & Peter Gedeck, 2nd Edition. Chapter 1, Estimates of location.

.


To view or add a comment, sign in

More articles by Alexander Garzo

Others also viewed

Explore content categories