Model fitting? You're doing it wrong!

Model fitting? You're doing it wrong!

The trusty weighted least squares fit is a piece of machinery that every data scientist should have in their toolbox. Often, however, we fail to properly appreciate its limitations and end up applying it in situations where its justification is dubious at best.

Take the simple case of fitting a straight line. What assumptions have you likely made about your dataset?

(1) The errors are normally distributed.

(2) The errors are well estimated.

(3) The errors in the Y direction dominate i.e e_y >> e_x. 

(4) The data do indeed come from a straight line.

Now take another look at your dataset. While most of the points seem to lie along a straight line, inevitably, there are a number of outliers lying way off the main distribution which are going to contribute towards your fit. So what do you do? 

Easy, sigma clip first and then perform a linear least squares fit right? Bish, Bash, Bosh. Not so fast! Sigma clipping is the process of iteratively fitting your model, each time removing N-sigma outliers until your fit converges. One problem with this approach is that it does not optimize an objective function i.e. there is no comparison test between any of the fits in the iterative sequence, so how do you know that the outcome is really the "best-fit"? The short answer is you don't! Furthermore, do we even really want to clip outliers, might they be telling us something? Keeping our set of original assumptions in mind, let's consider why these outliers might have arisen 

Could the outliers have arisen from a long tail in the underlying error distribution? i.e. is it possible that the errors are non-Gaussian? In this case, our first assumption is violated.

Is it possible that the points are not outliers and you've just completely underestimated their errors? Then the second assumption is violated.

Is it possible that these data points are well measured in the Y direction but have large errors in the x-direction? In that case, your third assumption is violated.

Is it possible that the data don't really come from a straight line? e.g. are you neglecting some background process that generates your outliers? In that case, your fourth assumption is violated. 

Often you'll have a dataset in which these assumptions are mildly violated and a weighted least squares fit will give you incorrect but satisfactory results. The problem is that you can't tell. If your dataset does have large outliers then at least one of these assumptions is not true and a weighted least squares fit is not for you! Instead, you need to write down an objective function that better describes the data.

Astronomer Prof. David Hogg advocates "mixture models" for this. i.e. assume that your data set is drawn from a probability distribution that is the sum of two or more independent probability distributions, one from which you draw data that lies on a straight line (optionally with errors in both dimensions) and the others from which you draw outliers.

This probably sounds like a bad idea because it means swapping your original two dimensional linear least squares problem with a much more formidable non-linear higher dimensional problem. Fortunately, as Hogg explains in his 1-hour YouTube lecture, Markov Chain Monte Carlo (MCMC) techniques make light work of such problems. I've linked parts 1 to 7 of these videos below. This is essential viewing for anyone wanting to start fitting their models properly. To get you started I am also throwing in a link to this "seriously kick-ass MCMC" sampler in Python, which really does make MCMC a breeze!

http://dfm.io/emcee/current/ You're welcome!


To view or add a comment, sign in

More articles by Joseph R. Findlay PhD

  • A Beginners guide to Projected Quasar Pairs

    BLACK HOLES: At the center of every galaxy sits a “supermassive black-hole”. These are no ordinary black holes – they…

    1 Comment

Others also viewed

Explore content categories