Being wrong about efficacy probabilities

Being wrong about efficacy probabilities

The Bayesian versus Frequentist confusion is not even the half of it

I was rather bemused by a recent discussion of probabilities on Twitter in which it was claimed that the sort of probabilistic statements that Bayesian statistics produces are easier to understand than Frequentist ones. My experience is that this may or may not be so but even if you were to replace frequentist confidence intervals (say) by Bayesian credible intervals, you would still find that many non-statisticians would struggle to understand what they conveyed.

No alt text provided for this image

Figure 1: 95% confidence intervals for the treatment effect for 60 simulations for a trial in asthma.

I take as exhibit one Figure 1, in which I have simulated the estimated mean treatment effect on forced expiratory volume in one second (FEV1) and the associated 95% confidence intervals for 60 parallel group trials trials in asthma. I am the god of this simulation universe and I fixed the true treatment effect to be 300ml (represented by a dashed black line) and you will see that 57 out of 60 intervals (the ones drawn in black with point estimates given by a diamond) cover this value and three (the ones in red with point estimates given by a circle) don't. So, 95% of the 95% confidence intervals cover the true value. It works pretty well doesn't it?

At this point I have to make a confession. I changed the simulation seed until I got exactly three failed coverages. Probability statements are themselves probabilistic and 95% confidence is a long-run value and sixty is a rather low number to count as the long run, I hope that the reader will indulge my cheating in this way. I could have given you six million intervals but a) life is short and b) the graph would have been rather crowded.

Be that as it may, any Bayesian worth their salt will be quick to point out that this is not useful. If we have only one trial, why should a long run average interest us? To that I might reply, well, I could calculate some Bayesian credible intervals which were identical to the confidence intervals. What's your problem?

That's all very well, the Bayesian might reply, but that is to assume that you know nothing. Suppose I know that in general treatment effects greater than 350ml are very implausible. This would give me the means of recognising that high estimates were unusual. Perhaps I would have identified your three failures as being such. To assume ignorance is often rather ignorant.

The lurking pachyderm

The elephant in the room, however, is this. I have not said what it is we are estimating and there are at least two distinct possibilities. The first is that I am estimating the treatment effect in the given trial. The second is that I am estimating the effect in the wider (admittedly rather ill-defined) 'population' of patients.

The first of these is often falsely assumed to be trivial and a matter of descriptive statistics only and it is then equally falsely assumed that what statistical analysis delivers is a statement about the latter. However, in a parallel group trial, with a one to one randomisation, which is what my simulation assumed, half the patients get the experimental treatment and half the control. The true treatment effect is the unobservable difference that would obtain were all patients to be give both treatments on different identical occasions. As it is I cannot observe this and have to estimate it with half the values missing. My population here is not the population of all patients but the population of all randomisations.

In fact, in the simulation, I set the true effect to be nearly constant from trial to trial. That being so, it really doesn't matter which question I answer. The variation from trial to trial is minimal. Suppose, however, that I now make this variation considerable. Figure 2 now shows the coverage of my confidence intervals for the average population effect.

No alt text provided for this image

Figure 2: 95% confidence intervals for the treatment effect for 60 simulations for a trial in asthma. The trial to trial variation for the true treatment effect is now considerable.

It can be seen that I am now doing rather badly. The problem is that all I have to estimate the variability is the within-trial variation in results. This does not cover the fact that the true treatment effect itself varies from trial to trial.

Rescuing the situation

How can I rescue this situation? Figure 3 shows how. All I need to claim is that all I am estimating is the treatment effect in the trial I actually ran. In the figure I no longer compare the treatment estimate to the global 'true' average effect (given by the dashed black line) but to the local true effects (given by the blue bars). You will now see that 57 out of 60 intervals do, indeed, cover the local true value.

No alt text provided for this image

Figure 3: The same 95% confidence intervals for the treatment effects as given in Figure 2. The intervals are now compared to the local true treatment effect.

The blue whale in the room

You may not like what I have done. You may claim that it is a cheat delivering an irrelevant answer that lacks ambition. That's up to you. Please tell me what your relevant solution is. If you think it's all solved by Bayes, I think that you are sadly mistaken but in any case, dealing with the elephant is a minor issue. There is actually a much larger mammal crowding the scene. What can we say about the effect on individual patients?

I won't attempt to deal with that now but watch this space

Great article! Apologies in advance for the long comment. Would performing inference only on the N subjects pose a challenge when interpreting variability? A population of randomizations could be hardest of all for clinicians and non-statisticians to appreciate and might mean we are limited to randomization tests. If we hold the randomization fixed and use a common testing procedure, how do we interpret the sample-to-sample variability in our observations. Do repeated experiments represent hypothetical panel data for the same N subjects? I don't think that is an appropriate interpretation. Do repeated experiments represent rewinding the hands of time on each experiment yet inexplicably getting different results in each experiment for the same N subjects? That is not palatable either. In either case, if there is no estimate of within-subject variability the between-subject variability represents measurement error, which may be inconceivably large. If there is an estimate of within-subject variability then we would need to be certain to constrain the between-subject variability to zero during inference since the target patient population is the N subjects in the trial. I think the best interpretation is that we are making inference on a broader patient population from which the study subjects were randomly selected. The between-subject variability represents randomly selecting different subjects in each repeated experiment from the population. If multiple studies appear to have different local truths we can conclude that: i) trial conduct was part of the intervention so that each study investigated a different intervention for the same target patient population, or ii) each study used a biased sample from the target patient population, or iii) each study sampled unbiasedly from different patient populations. This would still allow you to create the forest plot with the confidence intervals covering local true effects.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories