Be There Or B²
Is it a curse or a virtue? It is hard to say for sure. It certainly can irritate others. My wife, who spends the most time with me, is irritated the most. I get it because my son sometimes shows the same characteristic and I find myself getting irritated with him.
Oh, I have not told you what that is. I sometimes am excessively precise. I may interrupt my wife to get a particular detail correct, a correction that is really of no relevance to the conversation. The funny thing is this is only for some things, for others I can have the opposite issue. Oh, the contradictions we all possess. Regardless, precision is what this blog is about.
I have a pet peeve when writing a protocol. I know it is pedantic, even irrelevant. But, the things in the protocol should reflect precisely the reason for the section. Take the section of objectives and endpoints. What are these? Objectives are questions you wish to answer. Endpoints are the things within people that you are measuring to answer these questions. When you write, the endpoint includes the physical examination (often written), I always say no that is not an endpoint. It is a procedure. In fact, what do you note (or measure) during a physical examination? Adverse events. So just having the endpoint, adverse events, covers this procedure. You can mention physical exams in the procedures section but not the endpoint. Or maybe you will see that the endpoint is the event rate. How do you measure a rate in an individual. You do not. This is a statistic. It is fine to mention in the statistics section, but the endpoint is whether a subject has the event. That is the thing you are measuring.
There are other parts of the protocol that have this sort of issue. But, the key is that these mistakes that people make (and these mistakes are made a lot, I mean a lot) do not have much importance at all. What I mean is that having a statistics or a procedure in the endpoint will not in any way jeopardize the trial or harm a patient. I have always believed if you got these things right they would be better reads for regulators and IRBs, but I do not even know this. What I know is it is wrong and that bothers me. When I am given a draft of a protocol with these inconsequential errors, I correct. I suspect that the original author feels sort of like my wife does. At least it is correct.
This characteristic may ultimately be counter productive sometimes, other times it is a virtue. Recently I read a post on LinkedIn explaining to the audience what R² was. It explained how R² can be interpreted as the proportion of the variance explained, etc. Basically, a benign and correct explanation with little intuitive appeal. But, there was a section of interpretation in which it said when R² is less than 0 it meant that you predicted worse than the mean. Everything that I know about R² is that it is bounded by construction by 0. So, I made a somewhat snarky comment, basically that says can a correlation now be an imaginary number?
In this blog I will be walking a tight rope for being mathematically precise yet not confusing my non-mathematical readers. Let us remember the formula for R²=1-SSE/SST. The sum of squares means that for each observation’s value, one finds the value minus the predicted value, squares each of these, then adds them together. The higher the SS, the more error for the prediction. SSE is the error for the linear regression. SST is the error you would have if you simply predicted the mean value for everyone, ignoring the regression entirely. As can be seen, at best, you can have a perfect prediction for your linear regression, which is SSE=0. Or, at worst, your linear regression prediction is no better than the mean, which is SSE=SST. Notice this bounds R² between 0 and 1. You also should see why this is the total error explained by the regression, since R² is equivalently (SST-SSE)/SST. One of the concepts here is that R² is just a comparison of the fit of the regression to the fit of the mean. Lastly, one of the things that makes it work is due to construction, since SST must be greater than or equal to SSE.
Recommended by LinkedIn
So, this cannot be negative. I received replies insisting that it can be negative. What am I missing? Apparently, in the data science realm, the following problem existed. I fit my data on a training data set, I come up with a fixed model, how good does it predict on a validation data set? Thus, a new “R²” was invented for this purpose. Without becoming too mathematical, this “R²” compares the error of the prediction on the validation set to the error of the mean on either the training set or validation set. They do this in a way that looks like the R² formula and call it “R²”. The problem is that we no longer have by construction that one error is bounded by the other. Thus, not only can you have negative values, but values greater than 1 are possible. Is this a metric? Yes. Does it tell about fit? Probably. Is it an R² with the same interpretation? No. It certainly does not represent the proportion of variance explained (it can be less than -11 for instance). Calling it R² can do nothing but confuse.
At this point I recognize my criticism will hit those that use this as a slap. I am sorry about that but just because it is used, does not mean it should be. Or, if it is used then it should not be referred to as R². As a peace offering, let me give you a metric that I do not believe has been proposed that gives the spirit of R², if not the interpretation. That is a measure of your prediction on a validation data set. Let me introduce you to B² (I get to name it; I am Brian after all). First, on the validation data set use ordinary least squares to estimate the regression on the validation data set and its resulting Sum of Squares. Call this SSEv. Notice this is the lowest error that a model of this structure can have on the validation data. Now, in the validation data, find the error of the prediction from the training set. Call this SSEt. Notice SSEv/SSEt, will be bounded by 0 and 1. This ratio approaches 0 as the prediction error (SSEt) approaches infinity. That is the worse the prediction, the smaller this ratio is. Next this ratio is 1 if the prediction is in complete concordance to the validation data for a model of this complexity. This I think would be a valid and useful metric. So, B²=SSEv/SSEt.
In the end, you see that my sometimes nature about being precise can lead to something unique (or potentially unique). Not just unique, however, but a metric that would have real value if implemented. So, in answer to my first question, it is a curse until it becomes a virtue.
Read other articles I have written: Blog 3 — StartersGateBioQuantAdvisory
Never miss a weekly post again. I’ve started an email list for my blog. Besides sending my blog to your e-mail, I’m also kicking off an “Ask Brian” column. If you have a general question about drug development or quantitative work, send it in. I’ll answer what I can, depending on how many come through. You’ll also receive the occasional Starter's Gate update, which will be low-frequency and not spammy. Join the list at E-mail Sign Up — Starter's Gate BioQuant Advisory
Btw, R^2 for prediction models most definitely can AND SHOULD be negative in certain situations. Most notably, when evaluating a model (no matter how it was originally trained) on a new population and there has been sufficient population drift the R^2 will be negative because miscalibration has been introduced. It is appropriate that R^2 be negative and worse than the mean prediction from that drifted population, in my opinion. This is not a theoretical issue, external validation across geography, population, and time is often a necessary part of prediction model validation and assessment. Drifts frequently occur so it’s very useful to have negative values for the accuracy metric. It is indeed unfortunate and confusing that this metric is called R^2, but that nomenclature choice is a consequence of a long history and now difficult to change.
This is all missing the important role that discrimination and calibration play in predictive model performance and the desirable property that the usual out of sample R^2 definition as an accuracy measure nicely and intuitively captures those concepts via R^2 = Discrimination Index - Miscalibration Index. Moreover Discrimination Index is the R^2 of a perfectly calibrated prediction model (or of the recalibrated prediction model if not perfectly calibrated to start with). At least to me, these are some very nice properties. But, I suppose, different strokes …
I've run into the "negative R^2" thing a few times and it always throws me. My preferred definition of R^2 is "the squared-correlation between a prediction and observed outcome". The virtue of this is it's easy to say in plain words, keeps the usual boundedness, but has both in and out of sample versions, with the additional bonus that typically in-sample R^2 is higher than out-of-sample R^2. I think it is the same as B^2?