Beware of the Morlocks:
The Sphinx in The Time Machine

Beware of the Morlocks:

If you use a Bayesian Time Machine you may be in for some surprises

"It sounds plausible enough tonight,” said the Medical Man; “but wait until tomorrow. Wait for the common sense of the morning.”

HG Wells, The Time Machine, Ch II

Guide Rail

The recent FDA Guidance [1] on using Bayesian methods has attracted much interest and examples of an enthusiastic welcome are not hard to find. But as I often say:

At any celebration, the role of a statistician, especially a frequentist one, is often that of a bad fairy. Nobody invited them, they turn up at the end, spread gloom and send everyone to sleep.

So I am going to pay the role of a bad fairy here and rail against the guidance. I am less than enamoured by its recommendations and not because it is Bayesian per se but because, in its enthusiasm for Bayesian methods, it is in danger of promoting debased Bayes and possibly even debayesed Bayes.

Praise for Bayes

But first, let me quote some praise.

If the FDA follows through with the proposed guidelines, and they are not fatally twisted by pressure from the medical establishment and health care industry, it should bring fresh air and sunlight into the approval process. It should save money and speed innovation, with better health outcomes.

Aaron Brown. How To Speed Up the Search for Cures Through a Change in Probability Theory.

“Bayesian methodologies help address two of the biggest problems of drug development: high costs and long timelines,” said FDA Commissioner Dr. Marty Makary in a press release announcing the draft guidance. “Providing clarity around modern statistical methods will help sponsors bring more cures and meaningful treatments to patients faster and more affordably.”

PharmaVoice New FDA guidance that’s a ‘huge deal’ for clinical trials Why using Bayesian statistics could transform trial design for rare diseases and beyond.

Richard Lilford, professor of public health at the University of Birmingham, UK, has long called for greater adoption of bayesian approaches, such as in drug development for rare diseases, and was excited by the new guidance. “It’s good that after years of prompting, a decision body has decided to accept ‘grown up’ statistics,” Lilford told The BMJ …

Peter Doshi, BMJ 2026;392:s180

Grown Up or Blown Up?

Grown up statistics? I am not so sure. In my opinion, the issue is not so much Bayes versus frequentist statistics but in terms of a number of matters related to concurrent control.

The classical clinical trial uses concurrent control, is randomised and double blind. It is not always appreciated that the degree of guaranteed blinding is constrained by the degree to which the random sequence used can be guessed[2]. Furthermore, blinding, which is often taken as a way of dealing with patient expectations, has a side-effect that is frequently overlooked. It makes everything else also random. See Blind Date.

I am going to illustrate the problems that can arise by considering what happens once you abandon concurrent control. The particular context is that of adaptive designs. I have been a member of at least two data-safety monitoring boards in which allocation ratios have been varied and I have found this a challenging experience. I am going to illustrate why adopting a so-called Bayesian Time Machine, as proposed in a paper that the FDA guidance cites, will not necessarily eliminate problems with concurrent control that adaptive designs may create. My starting point is a paper by Saville et al that the FDA guidance cites[3].

Time's chariot

But at my back I always hear/ Time’s wingèd chariot hurrying near;

Andrew Marvell, To His Coy Mistress

Figure 1 below is based on a 2022 paper by Saville et al[3]. It is a hypothetical example in which arms are added or dropped over time in an adaptive design. Initially 50 patients are added to Arm 1 and 50 to Control. In period 2, 50 further patients are allocated to each arm. In period 3, 33 patients are added to each of Arm 1 and Control but a new treatment Arm 2 is now included in the trial and 33 patients are included on it and so forth.

Article content
Figure 1. Hypothetical allocation of patient numbers over 10 periods to five treatment arms and a control arm in an adaptive design. Based on Saville et al (2022)

If, in such a trial, you wish to eliminate a time trend, the classic frequentist way it is make sure that any treatment estimate is constructed from contrasts that respect concurrent control. Thus, for example, you can't compare Arm 3 and Arm 1 directly, since there is no time period in which both are given. However if, for example, Control can be regarded as being the same at all times (a point that will be examined in due course) then, for example you could compare the difference of Arm 3 to Control in Period 5 to the differences of Arm 1 to Control in periods 1 to 4 in order to obtain an indirect comparison of Arm 3 and Arm 1. This is the sort of thing that has a long history in incomplete block designs and a more recent one in network meta-analysis.

Saville et al call this Time Categorical Analysis. Figure 2 below shows variance multipliers (what you would have to multiply the residual mean square errors by in order to get variances of estimated effects) for the contrasts of each of the five arms compared to control, assuming a linear model applied to a continuous outcome. Two models are used. The first ignores time period and the second adjusts for it, treating time as levels of a categorical factor, as suggested by Saville et al.

Article content
Figure 2. Variance multipliers or five treatments compared to control using two different models. Dashed lines, period effects are ignored. Solid lines, period effects are eliminated.


What is plotted is the variance multipliers for the contrasts that could be calculated at the given time period. (In other words, results that lie in the future compared to that time period could not be used.) This is the situation that any data safety monitoring board would be faced with in 'real time'. (Note that because similar numbers are allocated at similar times to Arm 3 and Arm 4, their values are very similar.)

If the figure is studied, two important features can be noted. 1) The variance multipliers adjusting for period are generally higher than those not adjusting. 2) The variances reduce over time. An exception to 1) is the variance multiplier for Arm 1 in the first 4 periods. This is because the design is such that the information for Arm 1 is balanced with respect to Control. This is not the case for other treatments and it is not the case even for Arm 1 for later periods. In fact, even though, from period 5 onwards, Arm 1 is no longer studied, the variance multiplier continues to reduce because other information, in particular as regards Control, continues to accrue.

Blind faith

As some of the authors pointed out in an earlier paper in the same journal[4]:

In a platform trial with many experimental agents it may be difficult or impossible to blind patients to every possible arm. They may have different modes of administration or dosing in such ways that blinding to all arms becomes incredibly difficult and burdensome. It is not uncommon that patients in platform trials are unblinded to which possible treatment arm they receive, but remain blinded to whether they receive active or placebo of that treatment.

P(365)

Some years ago, I referred to trials in which patients do not know which treatment they are receiving but do know of at least one treatment in the trial they are not receiving, as veiled[5]. Consider a placebo-controlled trial involving two doses of a hormone replacement therapy to be delivered trans-dermally using adhesive patches. Unless every patient is given two patches, any patient will know from the size of the patch, either that they are not being given the lower dose, or that they are not being given the higher dose. Now, suppose that there are two possible placebo patches that can be used: a smaller one as a placebo to the lower dose and a larger one as a placebo to the higher dose. There will then be four treatment groups active lower dose and placebo to lower dose and active higher dose and placebo to higher dose. The trial is then veiled.

This has implications for analysis. For example, to compare the higher and lower dose in such a design in a way that would be credibly blinded, you need to first compare each dose to the corresponding placebo and then compare the differences to each other. Even though the two doses are given concurrently.

However, Saville et al claim[3]

Unlike typical historical controls or real-world evidence, these ‘‘contemporary’’ controls are enrolled with the same protocol, the same inclusion/exclusion criteria, and the same data elements. The only difference is time.

P491

This is not true. If blinding matters, then this creates further groups that cannot be categorised in terms of time period only. For example, in the design shown in Figure 1, in all periods from period 3 onwards at least two arms are being studied in addition to control. If there is to be any attempt at blinding, each active treatment requires its own control. In total we end up with, not 10 period categories but 24 placebo and time categories and the resulting variance multipliers if we respect the veiled design will be much higher than are given in Figure 2.

A further problem, that I shall not discuss in detail here, is that it is not unusual for additional centres to be added as a trial progresses and sometimes for others to leave. If the allocation rule is unchanged, concurrent control is not compromised. But if the rule is changed, then this has complex consequences.

O Brave New World, That has such Bayesians in't

I want to make it clear that I am not criticising Bayesian statistics per se, nor even the FDA's newfound enthusiasm for it. Nor am I claiming that there are no Bayesians who understand the points I have made. Nevertheless, some of the hype attending adaptive designs is not only ludicrous but dangerous. The principle values of adaptive designs are administrative efficiency and the ability to react swiftly to abandon unpromising treatments. Any claims beyond this should be viewed with suspicion[6].

References

  1. U.S. Department of Health and Human Services and Administration FaD. Use of Bayesian Methodology in Clinical Trials of Drug and Biological Products Guidance for Industry. In: (CDER) CfDEaR and (CBER) CfBEaR, (eds.). Rockville, MD 208522026, p. 25.
  2. Senn SJ. Fisher's game with the devil. Statistics in Medicine 1994; 13: 217-230. Research Paper.
  3. Saville BR, Berry DA, Berry NS, et al. The Bayesian Time Machine: Accounting for temporal drift in multi-arm platform trials. Clinical trials (London, England) 2022; 19: 490-501. 20220822. DOI: 10.1177/17407745221112013.
  4. Saville BR and Berry SM. Efficiencies of platform clinical trials: A vision of the future. Clinical trials (London, England) 2016; 13: 358-366. 20160217. DOI: 10.1177/1740774515626362.
  5. Senn SJ. A personal view of some controversies in allocating treatment to patients in clinical trials [see comments]. Statistics in Medicine 1995; 14: 2661-2674.
  6. Senn SJ. Being Efficient About Efficacy Estimation. Statistics in Biopharmaceutical Research 2013; 5: 204-210. Research. DOI: 10.1080/19466315.2012.754726.

The new (draft) guidance is almost void of adaptive designs. It consistently refers to the FDA-2019 guidance regarding adaptive designs (link below). I think your discussion on concurrent controls applies to frequentist and Bayesian approaches alike. I do agree that many of the superlatives in press releases and first-takes on the new guidance are not in sync with the actual content of the new guidance. As I see it, the "breakthrough" is mostly that industry now have a clearer view how FDA will evaluate applications of Bayesian statistical methods. This by itself will probably lead to wider adoption. Link to 2019-guidance on adaptive designs: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/adaptive-design-clinical-trials-drugs-and-biologics-guidance-industry

Like
Reply

Stephen, as always, this is a thoughtful and characteristically sharp perspective. I’ve always appreciated how you push the field to separate statistical enthusiasm from statistical rigor. As someone who is primarily a frequentist, I admit that when collaborators ask me about Bayesian approaches, my first instinct is usually skepticism and caution. Not because the methods lack merit, but because the assumptions, particularly around borrowing information across time or populations can become quite consequential if they’re not carefully justified. I agree that the real issue is less Bayes vs frequentist and more about preserving the integrity of trial design, especially concurrent control and protection against time trends in adaptive or platform trials. The recent guidance from the U.S. Food and Drug Administration is an important signal that regulators are open to modern statistical tools. But methodological flexibility should never be mistaken for immunity from bias. And Stephen… I may already be heading in the “bad fairy” direction myself — I just didn’t know what to call it until now. 🧚♀️

Thanks for this Important discussion. A productive way forward is not to frame this as Bayesian versus frequentist, but to structure how both are used. In forthcoming work in Statistics in Medicine, we introduce CARE (Clarify, Apply, Refine, Evaluate) for cluster trials. The approach anchors inference in a design-based, cluster-robust benchmark that remains valid under heterogeneity and imbalance, and then allows assumption-rich models—including Bayesian ones—as transparent refinements rather than defaults. The sequencing is deliberate: start with what the data support under minimal assumptions, then layer additional structure only when it is justified and computationally stable. That avoids false confidence from fragile covariance assumptions while still permitting efficient Bayesian learning where appropriate. In short, robustness and efficiency do not have to be competing camps—there is a disciplined way to do both. Preprint: https://www.researchgate.net/publication/376204429_Cluster_trials_inference_with_CARE

Thank you for the article. As per usual, it's the hype that is the danger. You said it best: "[A]dministrative efficiency": 🍾🎉 Rigorous biostatistician: 😱

Like
Reply

I try to stay out of these as it’s outside my wheelhouse, but I’ve been reading content both on here and in more sophisticated frameworks and they concern me. There is two hundred and fifty years of successes and failures and I am concerned that people may try and learn new ways to do things the wrong way rather than build a framework of formal tradeoffs. My concern is that it is happening in a less formal way than is advised for such an important process. I am not concerned by the use of Bayes. I am concerned the rule making process is less than what should be used. I argue in my field that everything except subjective Bayes is illegal to use. And everything except subjective Bayes should be illegal to use. I have no problem with Bayes. I have a problem with short cuts which produce bad results. I am less than certain that the framework is enough for the problem. But I don’t know the people. I am sure that they are serious and careful people. I am used to working with less than careful people. My concerns may be mooted by the skill of the people. As I said, it’s not my wheelhouse.

Like
Reply

To view or add a comment, sign in

More articles by Stephen Senn

  • Cards on the Table

    Starting over When I started in the pharmaceutical industry almost 40 years ago, I knew nothing about clinical trials…

    2 Comments
  • Causes and Covariates

    We are often told that statistics has never had much to do with causation. I find the claim surprising and am going to…

    7 Comments
  • Die, Dichotomy

    We have studied 21 435 unique randomized controlled trials (RCTs) from the Cochrane Database of Systematic Reviews…

    19 Comments
  • Pooling the Interaction

    ..

    7 Comments
  • Two ways to leave your ANOVA

    Feel the hate I hate the way researchers categorise analysis of variance (ANOVA) as one-way, two-way etc. It places the…

    5 Comments
  • Double Trouble

    I can highly recommend Adam Kucharski's book Proof as an entertaining and informative account of a matter that should…

    1 Comment
  • Illegible Eligible

    Several times recently, in published papers and in papers I have been asked to review, I have come across the quite…

    19 Comments
  • Knowing ANOVA

    Lost in space WARNING. I know nothing about agriculture and even less about spatial statistics.

    8 Comments
  • Bridge Over Trebled Order

    Fantasy remains a human right: we make in our measure and in our derivative mode because we are made: and not only…

    3 Comments
  • Estimand, Messtimand. Continue treading carefully

    Part 2. Bias Issues The past is a foreign country; they do things differently there.

    15 Comments

Others also viewed

Explore content categories