Type-M and Type-S errors in underpowered studies

Type-M and Type-S errors in underpowered studies

In clinical trials we make a lot of efforts to properly design studies to make it "powered" enough. Why do we do this? Well, there are two reasons: one obvious, and one less commonly known, but definitely no less important!

  1. To be able to detect (discern) what we want to detect. That's the obvious one :)
  2. To not be fooled totally by statistically discernible (= statistically significant) outcomes very far from the reality (exaggerated magnitude, flipped sign).

Otherwise, Type-M and type-S errors could just ruin our studies! The sad reality, however, is that randomised clinical trials make up only a small fraction of all research...

Many researchers simply cannot afford to recruit enough participants (e.g., patients) or do not have access to sufficiently rich data. So even if they correctly calculate the minimum necessary sample size, it may be completely out of their reach... Thus, many studies may be more or less underpowered, and the results may be far from the truth...

But what can cause such a damage? Well, it's because of just two basic facts everyone finds "trivial and obvious", but of serious consequences! Let's start from the beginning. Imagine that:

  1. there was a negligible actual effect in the population,
  2. your study was much underpowered,
  3. you were quite unlucky: the sampling (=data acquisition in the study) gave you data from the tails of these (virtual) population distributions. Values in the tails are more extreme than values in the middle of the distribution. And if extreme enough - they can easily create samples distanced a lot by their central tendencies (means, medians).

Article content

Can you feel the disaster approaching?

  1. The arithmetic mean is an additive measure (recall the summation sign in its formula?), therefore it's sensitive to larger observations (1+2+3 < 1+2+8 ≪ 1 + 2 + 20)
  2. And in small samples the contribution (the impact) of each data point is bigger than in larger ones. And the more such observations in a small sample (remember, we sample from the vicinity of the tail zone), the bigger the impact (especially if we talk about really small samples, like N=10-30!).
  3. Now double this problem. Why? Because if you make a comparison, then you need TWO samples, and BOTH samples can be harmed (why not? all samplings are equally probable, unless you modify it somehow; then you can make it even worse...) in exactly this way!

Let's have a look at this exemplary sampling:

Article content

With these two samples we clearly face the Type-M error (M - Magnitude). It's also called "exaggeration" or "inflation" factor, showing how many times, when repeating such study, your result may be exaggerated compared to the real effect, if found statistically discernible (=significant).

Article content

Things can go much worse, as there can be also a change of the sign, so your finding may show direction opposite to the actual one. This is the Type-S error (S - Sign).

Article content

Now imagine that we drew thousands of 10-observation samples from two Gaussian distributions sharing the same dispersion (SD=10) and with means differing by Δ=1. Then we compared these samples using a t-test at a significance level of α=0.05.

> retrodesign::retrodesign(A=1, s = 10*sqrt(2/10), df = 20-2, n.sims = 10000)
$power
[1] 0.05516129

$type_s
[1] 0.2705776

$type_m
[1] 11.20466        

The obtained power was ≈5.5%, resulting in the exaggeration factor reaching 11, with the risk of sign change among statistically significant results elevated to 27%! Let’s plot the results of 10,000 simulations and look at the numbers. ( BTW, I modified the code of the sim_plot() function, so now it prints the value of some necessary “internals”)

> set.seed(1000)
> retrodesign::sim_plot(A=1, s= 10*sqrt(2/10),df = 18, n.sims = 10000) +
theme_bw() + 
geom_hline(yintercept = 1, col="red", linewidth=1, linetype="dashed")

[1] "Positive: 392, Negative: 176, Mean (abs diff.): 11.7"        

Of the 568 statistically significant differences, 392 were positive and 176 were negative. This means that the probability of a type 1 error was slightly higher than the expected 5%, namely 5.68% - but this was expected with such a small data size.

The worrying thing was that these statistically significant findings were greatly exaggerated (depicted as ■), reaching 20 units in both directions! (recall the true Δ=1). There were also 176 differences with the opposite sign (▲), making ≈ 31%, which nicely agrees with the anticipated fraction (27%). The mean of the absolute values of all statistically significant differences was 11.7, which divided by the true effect Δ=1 perfectly matches the expected level (11.2).

The figure below illustrates the result of our simulation. Now imagine that each every ■ and ▲ could be your case… With just a small difference of Δ=1 unit in the population we easily reached samples with Δ>20!

Article content

An important remark is that power is (non-linearly) related with the Type-M error. It can be shown that below the 80% threshold the exaggeration is noticeably larger than 1. Not much, OK, but still. But then it starts growing quickly as the power drops. At power below 5% both M- and S- type error grow dramatically!

> retrodesign::retrodesign(A=0.5, s = 10*sqrt(1/10 + 1/10), df = 20-2, n.sims = 1000)
$power
[1] 0.0512874

$type_s
[1] 0.3785357  # <-- ~40%!

$type_m
[1] 21.40283   # <-- 21x inflation!        
Article content
made with the retrodesign R package

This explains why we use the traditional threshold of 80% for power in clinical trials.

So, the problem is disregarding statistical power. Due to low power your study may be either futile or give you just garbage. The power can be compromised due to:

  1. no power analysis (sample size calculation)
  2. ignoring possible dropouts (with no plan for handling missing data, no "what-if" analysis)
  3. incorrectly putting the desired effect size to be detected into the sample size formula, which is essentially important in studies with the practical importance (MCID - the minimal clinically important difference) embedded, like non-inferiority and clinical superiority with MCID > 0.
  4. being GREEDY in multiplying research questions. Focus on the 1-3 most important ones, don't test everything!
  5. applying unnecessarily conservative adjustments for multiple testing, like Bonferroni (we have MUCH better ways to control both FWER and FDR)
  6. using small pilot studies to "feed" the power analysis (because of the same story!)
  7. unnecessary switching to non-parametric rank-based methods and other methods altering the null hypothesis.
  8. disregarding assumptions of used statistical methods, using suboptimal parameters (covariance structures, etc.)

Some might say now:

With the Bayesian approach we can do better! Just use a certain prior distribution to fix these problems.

Yes, provided you have the necessary knowledge. The missing information cannot be obtained out of nowhere. Both frequentist and Bayesian methods will see exactly the same data. If the information cannot be obtained from the data or from the reliable sources (but hey! then this problem would not take place!), it must come from assumptions - because you will have to define somehow your prior distribution parameters.

If this knowledge is available, both Bayesians and frequentists have their tools and can address this problem appropriately. And if this knowledge is unavailable, then priors are nothing but guessing. No regularising priors, shrinkage estimators, etc will help there. In other words, no statistical "hocus-pocus" will work if you lack the domain knowledge.

Some also may say:

Don't use p-values, use confidence intervals!

First of all - in RCTs p-values are equally good and valuable measures as confidence intervals. But even if they were not, confidence intervals may do no better in this case. You will still obtain the wrong magnitude (and maybe also the sign), and it may lay far from 0. The only thing that is worrying here is its length, a potential indicator of an underpowered study - but that's all... See?

The 95% CI = [5.7 ; 31.3] - and is"oddly" long (we know why!) but, at the same time, is quite far from the null value (zero). So it should definitely trigger suspicion, but one could easily claim that the effect is “discernible from zero enough", especially at so low data size...

> t.test(control_points, treatment_points)

	Welch Two Sample t-test

data:  control_points and treatment_points
t = 3.0334, df = 17.999, p-value = 0.007147
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  5.701113 31.392050
sample estimates:
mean of x mean of y 
109.65957  91.11299         

In my post I used the awesome retrodesign R package to do the simulations. I encourage you to play with, especially if normally you play with.


So, every time you hear something like this: "Adrian, that's amazing! We achieved great results with just 10 observations! We'll publish it immediately if it passes review!", just tell them:

"OK, well, there are 2 options:

1) either your sampling came ideally from the middle of both compared distributions, so there population effect you assess through the samples reflecting it is real and you've got a big deal,

2) or you're just seeing a type-1 error with an exaggeratedly small or no effect, being unlucky enough to sample from certain halves of the distributions compared, which was amplified many times by the fact, that the fewer observations you have, the greater the impact of a new one is."

...and make them think about their situation


In my post I introduced the Type-M and Type-S errors for teaching about the threats resulting from underpowered studies. But if you want to employ this idea also to design or evaluate already conducted studies, please pay attention to this paper:

Lakens, D., Mesquida, C., Xavier-Quintais, G., Rasti, S., Toffalini, E., & Altoè, G. (2025, August 5). Rethinking Type S and M Errors. https://doi.org/10.31234/osf.io/2phzb_v1, freely available at https://osf.io/preprints/psyarxiv/2phzb_v1.

Abstract:

Gelman and Carlin (2014) introduced Type S (sign) and Type M (magnitude) errors to highlight the possibility that statistically significant results in published articles are misleading. While these concepts have been proposed to be useful both when designing a study (prospective) and when evaluating results (retroactive), we argue that these statistics do not facilitate the proper design of studies, nor the meaningful interpretation of results. Type S errors are a response to the criticism of testing against a point null of exactly zero in contexts where true zero effects are implausible. Testing against a minimum-effect, while controlling the Type 1 error rate, provides a more coherent and practically useful alternative. Type M errors warn against effect size inflation after selectively reporting significant results, but we argue that statistical indices such as the critical effect size or bias adjusted effect size are preferable approaches. We do believe that Type S and M errors can be valuable in statistics education where the principles of error control are explained, and in the discussion section of studies that fail to follow good research practices. Overall, we argue their use-cases are more limited than is currently recognized, and alternative solutions deserve greater attention.

📚 Here's a list of valuable links to blogs, papers and simulations:

  1. Mircea Zloteanu: What NOT to do with NON-“null” results Part III: Underpowered study, but significant results (I recommend reading the whole part 1-3 series!)
  2. Kenneth Tay, What is the exaggeration ratio (expected Type M error)?
  3. Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6), 641-651. https://doi.org/10.1177/1745691614551642 (Original work published 2014)
  4. Clay Ford , Assessing Type S and Type M Errors
  5. Giulia Bertoldo, Claudio Zandonella Callegher, Gianmarco Altoè, Designing Studies and Evaluating Research Results: Type M and Type S Errors for Pearson Correlation Coefficient
  6. Lu, Jiannan & Qiu, Yixuan & Deng, Alex. (2018). A note on Type S/M errors in hypothesis testing. British Journal of Mathematical and Statistical Psychology. 72. 1-17. 10.1111/bmsp.12132.l
  7. R package retrodesign. It provides tools for working with Type S (Sign) and Type M (Magnitude) errors, as proposed in Gelman and Tuerlinckx (2000) <doi:10.1007/s001800000040> and Gelman & Carlin (2014) <doi:10.1177/1745691614551642>
  8. Excellent simulator by Lukas Vermeer (GitHub sources)


We haven't considered the challenges of the selection of the effect size to detect in this post, so let me just announce these important papers (all should be downloadable directly from the provided links):

  1. Chuang-Stein, C., Kirby, S., Hirsch, I. and Atkinson, G. (2011), The role of the minimum clinically important difference and its impact on designing a trial. Pharmaceut. Statist., 10: 250-256. https://doi.org/10.1002/pst.459 - I found this important paper available at Wayback Machine: http://static.sdu.dk/mediafiles/C/A/1/%7BCA1F51E7-FD7E-441F-9769-11D77B990F08%7DChuang-Stein-Atlinson2011PharmStat.pdf?origin=publication_detail
  2. Cook J A, Julious S A, Sones W, Hampson L V, Hewitt C, Berlin J A et al. DELTA2 guidance on choosing the target difference and undertaking and reporting the sample size calculation for a randomised controlled trial BMJ 2018; 363 :k3750 doi:10.1136/bmj.k3750, https://www.bmj.com/content/363/bmj.k3750
  3. Salas Apaza JA, Franco JVA, Meza N, Madrid E, Loézar C, Garegnani L. Minimal clinically important difference: The basics. Medwave. 2021;21(3):e8149. doi: 10.5867/medwave.2021.03.8149, https://www.medwave.cl/medios/medwave/Mayo2021/PDF/medwave-2021-03-e8149.pdf
  4. Wong, H. Minimum important difference is minimally important in sample size calculations. Trials 24, 34 (2023). https://doi.org/10.1186/s13063-023-07092-8, https://trialsjournal.biomedcentral.com/articles/10.1186/s13063-023-07092-8
  5. Westlund, E., & Stuart, E. A. (2017). The nonuse, misuse, and proper use of pilot studies in experimental evaluation research. American Journal of Evaluation, 38(2), 246–261. https://doi.org/10.1177/1098214016651489, https://files.eric.ed.gov/fulltext/EJ1141165.pdf
  6. Vickers, A., Nolla, K., & Cella, D. (2025). Drop the “M”: Minimally Important Difference and Change Are Not Independent Properties of an Instrument and Cannot Be Determined as a Single Value Using Statistical Methods. Value in Health, 28(6), 894–897, https://www.sciencedirect.com/science/article/abs/pii/S1098301525004164
  7. Angst F, Aeschlimann A, Angst J. The minimal clinically important difference raised the significance of outcome effects above the statistical level, with methodological implications for future studies. J Clin Epidemiol. 2017;82:128-136, https://www.sciencedirect.com/science/article/pii/S0895435616307764
  8. Man-Son-Hing, M., Laupacis, A., O'Rourke, K., Molnar, F. J., Mahon, J., Chan, K. B., & Wells, G. (2002). Determination of the clinical importance of study results. Journal of General Internal Medicine, 17(6), 469-476, https://scispace.com/pdf/determination-of-the-clinical-importance-of-study-results-1las9461wz.pdf
  9. De Vet, H.C.W., & Terwee, C.B. (2010). The minimal detectable change should not replace the minimal important difference. Journal of Clinical Epidemiology, 63(7), 804–805; author reply 806, https://www.jclinepi.com/article/S0895-4356(10)00031-4/fulltext

Mohammad Ali Mansournia Hello, I thought you might find this interesting from the reviewer's perspective.

I added an important section with literature about the use of MCID in study designs

Like
Reply

Glad this view is catching on. I shared similar sentiments in the mid-2010s. It was not well received.

Adrian Olszewski Would this approach also work using 'observed/estimated effect size' for a post-hoc risk of bias assessment in a meta-analysis? Or would swapping the observed effect out for the 'true effect' misalign the Type M/S error probability estimation procedure in terms of what it reports? I wish I could aim for a median across observed effects in the situation I'm working on now (e.g., across-studies median or even a 'pooled' effect size, and within-studies SE, for each study's estimate), but I think sparsity will prevent this from working... and not sure of this proposed post-hoc strategy either, because I struggle with across-study effect size averaging procedures conceptually - I see the non-collapsibility of adjusted effects as being just as true of effects with residual confounding...

Like
Reply

We uploaded out preprint criticizing Type S and M errors. You might enjoy it: https://osf.io/preprints/psyarxiv/2phzb_v1

To view or add a comment, sign in

More articles by Adrian Olszewski

Explore content categories