R-Squared Abuse(and how to avoid it).
R^2 is the most well-known regression metric and often a misused one. This misuse results in wasted effort, inferior decisions, and at times regulatory findings. In this note I will survey R^2 abuse,its symptoms, and how to avoid it.
There are two common situations that reveal R^2 misuse---when the metric is very low and when it is very close to 1. First, when it is less than 0.5, many modelers and validators may fret that this means the model is bad or worthless, even though all available factors have been considered. The result is often time wasted trying to mine for some additional explanatory variable or manipulating the equations and data to produce the higher value, as described in the second half of this note.
For example, in a stress testing exercise, a regression may find the rate of change of a product’s outstanding balance regressed against available economic variables yields a rather low R^2. R^2 merely measures the fraction of the variance of the balance changes that are best explained by the non-random factors in the model. The low R^2 indicates that the influence of the economic activity on the growth/decay rate of that product is rather small, not that the model is bad.
The low R^2 could be a sign that the best model is a constant growth/decay rate, which in itself is a useful finding. To choose between models, one should consult an information criteria such as the Akaike (AIC) or Bayes-Schwarz Information Criteria(BIC or SBC) to identify the model that is most likely to be the “best” in a theoretical sense [1]. One can then assess whether the improvement justifies the higher operational cost and complexity of the additional terms. Documentation showing the underlying reasons for the decisions, the modeling process, and an understanding of the results will enable regulators to get comfortable with the model regardless of the R^2.
Another case where low R^2 is common is in large, dis-aggregated data sets, such as those using account level data. Aggregating data before analysis will make the R^2 appear higher by averaging out random noise. In the best cases this offers faster calibration times, but no improvement to the model; in the worst cases the loss of information produces an inferior model with a better looking R^2.
Very high R^2 for many economic models may be an artifact that provides some stakeholders with a dangerous and false sense of confidence. One common situation occurs when modeling the time dependence of a product balance. The modeler writes the equations with the product balance as a function of the prior month’s balance and a few macroeconomic variables. This is the “lagged dependent variables” formulation.
Because each month’s balance strongly influences the next month’s balance, the R^2 is extremely high, often 0.95 or more. This is too optimistic about the long term predictive power, as the forecast for a balance 24 months from now will not use the prior month’s actual, but a forecast which includes the accumulation of 23 months of the residuals.
Furthermore, many misinterpret low values of the coefficients of the independent variables as a lack of sensitivity and blame this on the “lagged dependent variables”. The high R^2 may initially generate unreasonable confidence, but later oversight groups or regulators dismiss the model as not-credible.
In such situations one should consider formulating the model in terms of balance changes. The balance predictions of this alternate formulation will be identical; however, the model more clearly shows the relative contribution of the random effects and the factors used.
The graph at the top of this article shows a hypothetical situation where the balance (blue) is dependent on some economic variable (red), but adjusts only slowly to changes in that variable. The same relationship can be written in two forms. One expresses the balance as a function of its prior month value and the economic factor . This produces an R^2 of 0.98. The other expresses the balance change as a function of its prior month value and the economic factor. It produces an R^2 of 0.3 .
Furthermore, in both cases the coefficient of the economic factor will be much lower than the long range sensitivity; however, there is nothing wrong with the model or its projections. We must use a simple scaling transformation on the coefficients to obtain what most people define as the sensitivity.
If the decision is to stick with the balance formulation, avoid prominent displays of the high R^2 and insure the presentations and documentation focus on other metrics rather than the R^2. Furthermore explicitly calculating and showing the long-term sensitivity coefficient will provide users with a better understanding of the relationship and avoid confusion or fears that the model understates the impact of the economy on the balance.
The second cause of unreasonably high R^2 is artifacts in the data gathering process. Perhaps many values in a time series were replaced using a linear equation by a provider or an interpolation involving the underlying factors. Such a situation could occur in regressions involving various interest rates where some data is from quotes that are not direct observations, but estimates.
The third cause of either unreasonably low or high R^2 is sample choice or even cherry-picking of data. An extreme example is the exclusion of data that refutes the hypothesis.
A more common challenge is that R^2 depends on the variation of the independent variables in the sample. Including a sample over which they have only small variations will produce a low R^2 because they cannot contribute much to the total variability of the dependent variable. Selecting a sample where these factors show the greatest dispersion will exaggerate the R^2. One should address these situations by understanding the process by which the data was collected and re-evaluate the approach to collecting and cleaning the data.
R^2 is often a useful and informative metric; but one should avoid looking for the R^2 as a measure of model or modeler quality. Instead modelers and model consumers should compare the proposed model against alternatives, including the existing model. These comparisons should focus on measures of expected prediction error over the relevant horizon.
[1] There are limitations to the usage of information coefficients that I am omitting for simplicity.
Agree with R^2 being misused - especially when model/residual diagnostics haven't been performed (to establish 'validity' of the model). Another common error is to hand-formulate a regression formula (without fitting least squares) and calculating R^2 as SSR/SST. This provides overly optimistic values and sometimes > 1. The right way (if there is one) to calculate R^2 for a hand-formulate formula is 1 - SSE/SST. This can never be greater than 1. I advise people to use Mean Absolute Percent Error/MAPE as an easily understandable metric - compared to AIC or BIC