Feature Scaling with Percentiles Instead of Sample Deviation

Mike Neergaard

Published May 25, 2023

I am writing a series of lectures for my son and his university friends on AI, and I had a hard time finding one of my favorite tricks online -- a trick that helps wrangle data with garbles.

In essence, the trick is this -- perform feature standardization, but use the median for the sample mean and use percentiles instead of the sample deviation.

Normalizing Data

Check out this set of 2000 synthetic data points:

Obviously this data needs to be scaled, as the x and y scales differ by three orders of magnitude. Viewed to scale, the y-axis variation dominates the x-axis variation, producing vertical smears:

Let's take a moment to agree that this representation distorts the data. In this representation, two-dimensional Gaussian clusters have been reduced to one dimension. No matter what we intend to do next, it will be less effective, because we have thrown away data. Also, we are punishing the use of a particular set of units, which is always inappropriate.

A canonical scaling technique, sometimes called normalizing the data, translates and scales the data to have sample mean zero and sample deviation one in each coordinate:

This scaling has several advantages, including the fact that imputing missing data to the mean is simply a matter of replacing missing coordinates with zero.

The Effect of a Garble on Normalization

Consider what happens if a garble occurs. What happens if one data point is replaced with a garbled 32-bit integer? The single garbled data point is huge by comparison, and dominates the scaling. All of the real data points are crammed into the lower left of this graph:

Substituting Median and Percentiles

I have seen recommendations to treat this situation by using anomaly detection to delete outliers, but I have an alternative suggestion -- replace the mean and standard deviation calculations with metrics more robust against outliers.

Let's first consider the perfect situation. If data is unbiased, the mean (x̄) and the median (Median(x) ) will coincide. In Gaussian data, 68.27% of the distribution lies within one standard deviation (sₓ) of the mean. Put another way, x̄-sₓ is approximately the 15.865% percentile (X(0.15865) ≈ x̄-sₓ) and x̄+sₓ is about the 84.135% percentile (X(0.84135) ≈ x̄+sₓ). Thus, translating by the mean and scaling by the standard deviation should be roughly equivalent to translating by the median and scaling by percentiles. Even in our case, with multiple clusters, the results are notably similar:

When the situation is imperfect, however, differences can arise. What happens when we add a garbled data point? Again, when a single data point is garbled, the actual datapoints cluster at the bottom left of the graph:

When we zoom in on the actual datapoints, however, we see the benefits of not averaging in an enormous garble. Normal scaling with the garble pushed all the actual data points to negative values, and virtually eliminated one of the dimensions. Neither of those things occur here:

Closing Thoughts

I find this method of scaling helpful, as it is helps mitigate corrupted data. However, there is no substitution for data exploration. Looking for impossible values, plotting histograms, and visualizing data are all exceptionally valuable tools, any one of which could have caught this particular outlier. However, sometimes it is helpful to have an alternative approach to scaling that is more resilient against corrupted data. This is one of mine.

To view or add a comment, sign in

Feature Scaling with Percentiles Instead of Sample Deviation

Mike Neergaard

Normalizing Data

The Effect of a Garble on Normalization

Recommended by LinkedIn

Substituting Median and Percentiles

Closing Thoughts

More articles by Mike Neergaard

Others also viewed

Introduction to Group Feature Selection

When does synthetic data actually work? A framework.

Algorithms and Big Data are NOT substitutes for decision making

Unveiling the Hidden Patterns: Mastering the Art of Non-Linear Data in Data Science.

The Hidden Biases in Our Models We Don’t Talk About Enough

The Data You Have Is Not the Data You Think You Have

Data revolution - favours the brave

Addressing Imbalanced Data and Overfitting in Binary Classification: Insights from a Credit Card Default Prediction Project

Why Athena Keeps Asking 'What Do You Mean By That?

Why we invested in Watt Data

Explore content categories

Normalizing Data

The Effect of a Garble on Normalization

Recommended by LinkedIn

Substituting Median and Percentiles

Closing Thoughts

More articles by Mike Neergaard

Deriving Principal Components Analysis from Singular Value Decomposition - Part II (PCA)

Deriving Principal Components Analysis from Singular Value Decomposition - Part I (SVD)

Others also viewed

Introduction to Group Feature Selection

When does synthetic data actually work? A framework.

Algorithms and Big Data are NOT substitutes for decision making

Unveiling the Hidden Patterns: Mastering the Art of Non-Linear Data in Data Science.

The Hidden Biases in Our Models We Don’t Talk About Enough

The Data You Have Is Not the Data You Think You Have

Data revolution - favours the brave

Addressing Imbalanced Data and Overfitting in Binary Classification: Insights from a Credit Card Default Prediction Project

Why Athena Keeps Asking 'What Do You Mean By That?

Why we invested in Watt Data

Explore content categories