Overfitting

Overfitting

Overfitting

In statistics, the name given to the act of mistaking noise for a signal is overfitting.

Suppose that you’re some sort of petty criminal and I’m your boss I deputize you to figure out a good method for picking combination locks of the sort you might find in a middle school--maybe we want to steal everybody’s lunch money. I want an approach that will give us a high probability of picking a lock anywhere and anytime. I give you three locks to practice on—a red one, a black one, and a blue one.

After experimenting with the locks for a few days, you come back and tell me that you’ve discovered a fool proof solution. If the lock is red, you say, the combination is 27-12-31. If it’s black, use the numbers 44-14-19. And if it’s blue, its’s 10-3-32.

I’d tell you that you’ve completely failed your mission. You’ve clearly figured out how to open these three particular locks. But you haven’t done anything to advance our theory of lock-picking—to give us some hope of picking them when we don’t know the combination in advance. I’d have been interested in knowing say, whether there was a good type of paper clip for picking these locks, or some sort of mechanical flaw we can exploit. Or failing that, if there’s some trick to detect the combination: maybe certain types of numbers are used more often than others? You’ve given me an overly specific solution to a general problem. This is overfitting, and it leads to worse predictions.

(Or it would be like picking the Cavs or the Spurs to win the NBA championship in 2015 because a Lebron led team or Spurs had won the last three championships.)

The name overfitting comes from the way that statistical models are “fit” to match past observations. The fit can be too loose—this is called underfitting—in which case you will not be capturing as much of the signal as you could. Or it can be too tight –an overfit model—which means that you’re fitting the noise in the data rather than discovering its underlying structure. The latter error is much more common in practice.

Overfitting represents a double whammy: it makes our model look better on paper but perform worse in the real world. Because of the latter trait, an overfit model eventually will get its comeuppance if and when it is used to make real predictions. Because of the former, it may look superficially more impressive until then, claiming to make very accurate and newsworthy predictions and to represent an advance journal or to sell to a client, crowding out more honest models from the marketplace. But if the model is fitting noise, it has the potential to hurt the science. 

From the signal and the noise: why so many predictions fail but some don't by Nate Silver 

To properly forecast a set of data points, it's necessary to first understand all the factors which might influence the results. Fitting a model to the set of points is futile.

Like
Reply

To view or add a comment, sign in

More articles by Stephanie Bartruff

Others also viewed

Explore content categories