Feature Scaling with Percentiles Instead of Sample Deviation

Feature Scaling with Percentiles Instead of Sample Deviation

I am writing a series of lectures for my son and his university friends on AI, and I had a hard time finding one of my favorite tricks online -- a trick that helps wrangle data with garbles.

In essence, the trick is this -- perform feature standardization, but use the median for the sample mean and use percentiles instead of the sample deviation.

Normalizing Data

Check out this set of 2000 synthetic data points:

No alt text provided for this image

Obviously this data needs to be scaled, as the x and y scales differ by three orders of magnitude. Viewed to scale, the y-axis variation dominates the x-axis variation, producing vertical smears:

No alt text provided for this image

Let's take a moment to agree that this representation distorts the data. In this representation, two-dimensional Gaussian clusters have been reduced to one dimension. No matter what we intend to do next, it will be less effective, because we have thrown away data. Also, we are punishing the use of a particular set of units, which is always inappropriate.

A canonical scaling technique, sometimes called normalizing the data, translates and scales the data to have sample mean zero and sample deviation one in each coordinate:

No alt text provided for this image

This scaling has several advantages, including the fact that imputing missing data to the mean is simply a matter of replacing missing coordinates with zero.

The Effect of a Garble on Normalization

Consider what happens if a garble occurs. What happens if one data point is replaced with a garbled 32-bit integer? The single garbled data point is huge by comparison, and dominates the scaling. All of the real data points are crammed into the lower left of this graph:

No alt text provided for this image

More to the point, if we zoom in to the actual datapoints, all of them are negative now, to counter the garbled data point. Worse, none of the horizontal spreading we promised ourselves has occurred. Because the garbled datapoint has dominated the averages, the datapoints have once again been reduced to 1-dimensional smears:

No alt text provided for this image

Substituting Median and Percentiles

I have seen recommendations to treat this situation by using anomaly detection to delete outliers, but I have an alternative suggestion -- replace the mean and standard deviation calculations with metrics more robust against outliers.

Let's first consider the perfect situation. If data is unbiased, the mean (x̄) and the median (Median(x) ) will coincide. In Gaussian data, 68.27% of the distribution lies within one standard deviation (sₓ) of the mean. Put another way, x̄-sₓ is approximately the 15.865% percentile (X(0.15865) ≈ x̄-sₓ) and x̄+sₓ is about the 84.135% percentile (X(0.84135) ≈ x̄+sₓ). Thus, translating by the mean and scaling by the standard deviation should be roughly equivalent to translating by the median and scaling by percentiles. Even in our case, with multiple clusters, the results are notably similar:

No alt text provided for this image

When the situation is imperfect, however, differences can arise. What happens when we add a garbled data point? Again, when a single data point is garbled, the actual datapoints cluster at the bottom left of the graph:

No alt text provided for this image

When we zoom in on the actual datapoints, however, we see the benefits of not averaging in an enormous garble. Normal scaling with the garble pushed all the actual data points to negative values, and virtually eliminated one of the dimensions. Neither of those things occur here:

No alt text provided for this image

Closing Thoughts

I find this method of scaling helpful, as it is helps mitigate corrupted data. However, there is no substitution for data exploration. Looking for impossible values, plotting histograms, and visualizing data are all exceptionally valuable tools, any one of which could have caught this particular outlier. However, sometimes it is helpful to have an alternative approach to scaling that is more resilient against corrupted data. This is one of mine.

To view or add a comment, sign in

More articles by Mike Neergaard

Others also viewed

Explore content categories