Randomization Procedures

Data Scientist who designs experiments and fixes broken metrics | Causal Inference | 50+ publications, 1 federal policy change | R, SQL

7,074 followers 1mo

There's a statement about A/B testing that gets repeated so confidently that nobody pushes back on it. It's not wrong, exactly. It's just not as true as people think. "Random assignment ensures the groups are equivalent." This gets said in every A/B testing primer, every experimentation course, every stakeholder meeting where someone asks "but how do we know the groups are comparable?", and some of the comments on my LinkedIn posts. And it's... almost true. Random assignment ensures the groups are equivalent in expectation, meaning if you repeated the randomization thousands of times, the average difference between groups on any variable would be zero. In stats language, we would say, "There's no systematic bias." Practically, we can be confident that nobody is cherry-picking who gets treatment. But you don't run thousands of randomizations. You only run one. And one randomization is one draw from that distribution. Big sample? The draw is almost certainly fine. Small sample? You can get groups that look nothing alike, and the randomization didn't fail. That's just how probability works at small N. I ran a simulation. Take 40 people with a known covariate — say, prior engagement score — and randomly split them 20/20. Do it 500 times. Some splits are nearly perfect. Others are off by more than half a standard deviation. Every single one of those is "correctly randomized." Some of them will absolutely give you misleading results if you don't deal with it. Do the same thing at 1,000 per group and the distribution of imbalances basically disappears. That's the Law of Large Numbers doing its thing. But n=20 isn't large, so you can't count on LLN to save you. So what do you actually do when your experiment isn't huge? 𝗖𝗵𝗲𝗰𝗸 𝗯𝗮𝗹𝗮𝗻𝗰𝗲 𝗯𝗲𝗳𝗼𝗿𝗲 𝘆𝗼𝘂 𝗹𝗼𝗼𝗸 𝗮𝘁 𝗼𝘂𝘁𝗰𝗼𝗺𝗲𝘀. Compare the groups on everything you measured pre-treatment. Age, tenure, prior usage, whatever. If something is meaningfully off, you need to know before you interpret results. 𝗦𝘁𝗿𝗮𝘁𝗶𝗳𝘆 𝗼𝗿 𝗯𝗹𝗼𝗰𝗸 𝘁𝗵𝗲 𝗿𝗮𝗻𝗱𝗼𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻. If you know certain variables matter, force balance on them upfront. Randomize within strata. 𝗔𝗱𝗷𝘂𝘀𝘁 𝗳𝗼𝗿 𝗯𝗮𝘀𝗲𝗹𝗶𝗻𝗲 𝗰𝗼𝘃𝗮𝗿𝗶𝗮𝘁𝗲𝘀 𝗶𝗻 𝘁𝗵𝗲 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀. Be smart about using information you already have, and it almost always gives you tighter estimates. (But make sure you're not adding in colliders or post-treatment effects!) And if you ran a small experiment without doing any of this — just be honest about it. The results might be fine. But you're trusting luck more than you think. Randomization solves the systematic bias problem. It doesn't solve the bad luck problem. Those are different things, and small experiments are exactly where the difference shows up.

42 Comments

Sean Taylor

Model Measurement at OpenAI

5,634 followers 2y

Very excited to share a new paper that has been a long time in the making. This has been a fun collaboration with my co-authors Ruoxuan Xiong (Emory) and Alex Chin (my co-worker at Lyft and now Motif Analytics). Randomized experiments are the gold standard for measuring causal effects, but in marketplaces we are often testing policies that have many plausible spillovers that make it difficult to learn what we need by assigning treatment across users. Instead we randomize over time. This type of experiment seems simple to design, you are implementing a square wave (a type of oscillator) that determines what policy you are running based on time. When I was at Lyft, we had some heuristics for choosing switchback parameters but we rarely had bandwidth to understand their impact. It turns out to be a rich design space, and by choosing how and when you switch policies, you control the bias and variance of the estimates from your experiment. Intuitively, faster switching yields lower variance by increasing your sample size but increases bias because effects tend to persist over time (carryover effects). Your measurements from each time period are also correlated and have heteroskedastic errors due to seasonality (marketplaces tend to have strong daily and weekly cycles). Our approach is effectively a model-based design process where we use historical data to estimate the inputs to the experimental design process. The data allow us to make informed decisions about switching behavior that will yield the lowest error in our estimates. Carryover effects are the hardest quantity to estimate from historical data because on any individual test they are quite noisy, so pooling is necessary to gain some additional precision. We analyze a corpus of hundreds of switchback tests from Lyft's marketplace, and cluster them into an interpretable distribution over impulse responses. A broader point of this research is that all experimental designs lean on prior knowledge to improve the chances of a successful experiment -- even choosing a sample size for desired power in a standard A/B test. In switchback tests, there is an important bias-variance tradeoff we must manage. Without some means to estimate the covariance of errors and the likely size and shape of carryover effects, it is difficult to design an experiment that is likely to be successful.

9 Comments

Cliff Eala

Behavioural Strategist | Technologist | Author

7,348 followers 1y

I’m working on 5 Behavioural Science experiments across 𝟱𝟳𝟬𝗸+ 𝗽𝗲𝗼𝗽𝗹𝗲 today. An experiment, specifically a 𝗿𝗮𝗻𝗱𝗼𝗺𝗶𝘀𝗲𝗱 𝗰𝗼𝗻𝘁𝗿𝗼𝗹𝗹𝗲𝗱 𝗲𝘅𝗽𝗲𝗿𝗶𝗺𝗲𝗻𝘁 (𝗥𝗖𝗘), is the gold standard for testing whether a drug 💊or a vaccine 💉works. Scientists use randomised controlled trials to 𝗶𝗻𝗳𝗲𝗿 whether a drug is 𝗰𝗮𝘂𝘀𝗶𝗻𝗴 𝘁𝗵𝗲 𝗵𝗲𝗮𝗹𝗶𝗻𝗴 𝗲𝗳𝗳𝗲𝗰𝘁 for which they designed it. In my work as a behavioural strategist, I always prefer to test an intervention (aka treatment) 𝘁𝗼 𝗲𝘀𝘁𝗮𝗯𝗹𝗶𝘀𝗵 𝗰𝗮𝘂𝘀𝗮𝗹 𝗲𝘃𝗶𝗱𝗲𝗻𝗰𝗲 𝗯𝗲𝘁𝘄𝗲𝗲𝗻 𝘁𝗵𝗲 𝗶𝗻𝘁𝗲𝗿𝘃𝗲𝗻𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝘁𝗵𝗲 𝘁𝗮𝗿𝗴𝗲𝘁 𝗰𝗵𝗮𝗻𝗴𝗲 𝗶𝗻 𝗯𝗲𝗵𝗮𝘃𝗶𝗼𝘂𝗿. For instance, if I’m adjusting the buying journey, I’d like to test whether the adjustment is causing the shift in buying behaviour before I throw funds into scaling the adjustment.🤔 I also want to avoid 𝗳𝗮𝗹𝘀𝗲 𝗽𝗼𝘀𝗶𝘁𝗶𝘃𝗲𝘀❌ i.e., conclude that the adjustment works when I just got lucky. Many loosely refer to RCEs as an ‘A/B test’ (or ‘A/B/n test’ if there is more than one intervention to test). I’m careful about using those terms because many of these tests have sloppily disregarded an RCE cornerstone – 𝗿𝗮𝗻𝗱𝗼𝗺𝗶𝘀𝗮𝘁𝗶𝗼𝗻. Why is randomisation important? Because without randomisation, we end up 𝗯𝗶𝗮𝘀𝗶𝗻𝗴 𝗼𝘂𝗿 𝗿𝗲𝘀𝘂𝗹𝘁𝘀.😮 There are 𝟯 𝗽𝗼𝗶𝗻𝘁𝘀 in an RCE relevant to randomisation that I’d like to highlight: selection, allocation, and intervention delivery. 1️⃣& 2️⃣𝗦𝗲𝗹𝗲𝗰𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗔𝗹𝗹𝗼𝗰𝗮𝘁𝗶𝗼𝗻. Randomly select individuals from your population to create a representative sample for the experiment and randomly allocate them to your treatment and control groups. Both the 𝘀𝗲𝗹𝗲𝗰𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗮𝗹𝗹𝗼𝗰𝗮𝘁𝗶𝗼𝗻 𝘀𝗵𝗼𝘂𝗹𝗱 𝗯𝗲 𝗿𝗮𝗻𝗱𝗼𝗺. Selecting and allocating by the first letter of the last name, by the order individuals are stored in a database, or by the city they live in aren’t random. Any time you follow a pattern, you throw randomisation out and bring bias in. 3️⃣𝗜𝗻𝘁𝗲𝗿𝘃𝗲𝗻𝘁𝗶𝗼𝗻 𝗗𝗲𝗹𝗶𝘃𝗲𝗿𝘆. Ideally, you should deliver the intervention and control (baseline) treatments to participants at the same time. But, sometimes, this is not possible. For instance, if you have to broadcast a treatment message to 𝟱𝟬𝟬𝗸 𝗽𝗮𝗿𝘁𝗶𝗰𝗶𝗽𝗮𝗻𝘁𝘀 and your messaging system only allows you 𝟭𝟬𝗸 𝗺𝗲𝘀𝘀𝗮𝗴𝗲𝘀 𝗽𝗲𝗿 𝗵𝗼𝘂𝗿, you must randomly sequence your broadcast. You don’t want a particular treatment group A to receive your message at 7 a.m., while treatment group F receives your message at 7 p.m., to avoid broadcast time from biasing your results (unless broadcast time is a treatment in itself). As Matteo Maria Galizzi, my mentor from The London School of Economics and Political Science (LSE) taught me, 𝗻𝗼 𝗮𝗺𝗼𝘂𝗻𝘁 𝗼𝗳 𝗱𝗮𝘁𝗮 𝘀𝗰𝗶𝗲𝗻𝗰𝗲 𝗰𝗮𝗻 𝗳𝗶𝘅 𝗮 𝗳𝗮𝘂𝗹𝘁𝘆 𝗲𝘅𝗽𝗲𝗿𝗶𝗺𝗲𝗻𝘁𝗮𝗹 𝗱𝗲𝘀𝗶𝗴𝗻. #behavioraleconomics #behavioralscience #behavioraldesign

1 Comment

Victor GUILLER

Design of Experiments (DoE) Expert @L’Oréal | 💪 Empowering R&I Formulation labs with Data Science & Smart Experimentation | ⚫ Black Belt Lean Six Sigma | 🇫🇷 🇬🇧 🇩🇪

3,025 followers 2y

"𝑩𝒍𝒐𝒄𝒌 𝒘𝒉𝒂𝒕 𝒚𝒐𝒖 𝒄𝒂𝒏, 𝒓𝒂𝒏𝒅𝒐𝒎𝒊𝒛𝒆 𝒘𝒉𝒂𝒕 𝒚𝒐𝒖 𝒄𝒂𝒏𝒏𝒐𝒕" (George E.P Box) 🔊 Noisy data is meaningless data. Improper procedures to subtract out the #noise in data can lead to a false sense of accuracy or false conclusions. Fortunately, reducing noise in experimental designs can be effectively achieved through the strategic use of randomization and blocking. 🎲 If the nuisance factor is 𝐮𝐧𝐤𝐧𝐨𝐰𝐧 and 𝐮𝐧𝐜𝐨𝐧𝐭𝐫𝐨𝐥𝐥𝐚𝐛𝐥𝐞, randomization can be used. Randomization involves the allocation of treatments to experimental units at random to avoid any bias in the experiment resulting from the influence of some extraneous unknown factor that may affect the experiment. It is used to avoid confounding between treatment effects and other unknown effects (spurious correlations). 🔢 Randomization serves to achieve a homogeneous error distribution. By randomizing, any potential instrument drift or unknown bias is expected to "average out," contributing to a more uniform and unbiased experimental outcome. Usually a random number generator is used to allocate runs in the design. 🔲 If the nuisance factor is 𝐤𝐧𝐨𝐰𝐧 and 𝐜𝐨𝐧𝐭𝐫𝐨𝐥𝐥𝐚𝐛𝐥𝐞, blocking can be used. Blocking is a restriction of complete randomization, where similar experimental units are grouped in blocks. By comparing treatments within the same block, block effects are eliminated in the comparison of treatment effects, effectively minimizing bias and variance. This proves valuable in removing the impact of known nuisance factors that hold no interest for the experimenter. Within a block, run order should still be randomized. ⚖ The general rule for blocking is to construct blocks using high-order interaction columns. However, this introduces a trade-off, as the blocking effect becomes aliased with the high-order interaction effect. This occurs under the assumption of negligible high-order interactions, often valid in practice. ⏳ Bonus : For experiments sensitive to aging or run order, you can include time as a covariate in the model to build designs robust to time trends. For more info, see chapter 9: "Experimental design in the presence of covariates" from "Optimal Design of Experiments: A case study approach" by Peter Goos and Bradley Jones. 📸 Photo: "The Blocking Principle" from the course "Experimental Design basics" led by Douglas C. Montgomery (Arizona State University) #designofexperiments #statistics

6 Comments

Lu Qian

Biostatistician & Founder | Adaptive & Bayesian Trial Design | Regulatory-Grade Evidence for Go/No-Go Decisions

1,548 followers 2w

“RAR reduces power” is one of the most repeated claims in adaptive trial design. It isn’t wrong. It’s wrong about the wrong procedures. The simulations behind this claim, Thall, Fox and Wathen (2015) and Korn and Freidlin (2011), all test essentially the same thing: Thompson Sampling or the Thall–Wathen procedure. Both are patient-benefit–oriented designs. They push allocation toward the arm that currently looks best, without explicitly targeting inferential efficiency. The power penalty those papers show is real. But they are not the only way to adapt allocation. DBCD and ERADE target a pre-specified optimal allocation ratio and converge toward it. In Robertson et al. (2023) simulations under p₀ = 0.25, p₁ = 0.35: • Thompson Sampling: 14% probability of sending more patients to the inferior arm • DBCD / ERADE: near zero — comparable to permuted block randomization On power: • ER, DBCD, ERADE: ~80% power at N ≈ 700 • Thompson Sampling: ~52% There’s also a second issue the “power penalty” literature largely ignores: the test statistic. The 2–4% power gains from DBCD reported by Tymofyeyev et al. (2007) were calculated using the score test. If you instead use the standard Wald z-test, which is what most software defaults to, Pin, Villar and Rosenberger (2025) show type I error can inflate dramatically under optimal RAR allocation. The fix is simple: pre-specify the Pearson chi-squared test as the final analysis. New post working through both issues: 👉 https://lnkd.in/gW7aDRXs Thanks to William F. Rosenberger (George Mason University) for reviewing a draft of this post, and to Lukas Pin for additional context on the published Biometrics paper. #ClinicalTrials #AdaptiveDesign #Biostatistics #BayesianStatistics #ResponseAdaptiveRandomization

The Power Penalty That Depends on Which RAR You Use evidenceinthewild.com

6 Comments

Rutger Lit

Senior Lead Decision Scientist @ ADC | Airline Pricing & Experimentation | Revenue Management | Founder, Time Series Lab

1,828 followers 4mo

✈️ Week 3: Pods, randomization and why splitting a network is harder than it looks Randomization is the foundation of trustworthy experiments. If we randomly assign routes into groups, the estimated treatment effect is unbiased in expectation. Run the same experiment many times, and the average result would converge to the actual causal effect. But unbiased does not mean low variance. Two groups built by pure random draw can look different even without treatment. That is variance, not bias. Stratified sampling reduces this by randomizing within similar clusters, keeping the estimate unbiased but more stable. ──────────────── ⚠️ Randomization gets complicated inside a connected network Splitting routes into pods is one way to structure experiments, but two issues appear immediately. • Route substitution interference Routes influence each other. Changing prices on AMS JFK can push passengers toward AMS BOS or other hubs. If pods interact, the treatment effect gets blurred. • Shared-segment interference For example, the itineraries FRA LHR JFK and MUC LHR JFK share the LHR JFK segment. If that segment is treated in one pod but the connecting routes end up in another, the groups are no longer cleanly separated. Decisions on the shared segment influence both pods. Good pods aim to reduce interference. Similar ODs stay together, strongly connected ODs do not end up in different groups, and connecting itineraries ideally remain inside a single pod. ──────────────── 🔍 Why pod design becomes an optimization problem With only a handful of routes, pod creation can be intuitive. As the network grows, shared segments and inventory interactions expand much faster than the number of routes. The space of valid splits explodes, making it impossible to find clean groups by intuition alone. At scale, pod construction is no longer just randomization, it becomes clustering under network constraints, essentially an Operations Research question. Sometimes the conclusion is simple: A clean split is not possible without too much interference. ──────────────── 🚀 When pods fail, go temporal If interference is everywhere, forcing pods can do more harm than good. A full network switchback is often cleaner. Treat everyone in alternating time windows. No cross contamination because there are no separate groups. ──────────────── Takeaways • Randomization gives unbiased estimates, but variance matters. • Stratification reduces variance without introducing bias. • Pod design in airlines is about controlling interference. • Dense networks turn pod design into an optimization problem. • Sometimes the best split is no split at all, switch the entire network over time. #AirlinePricing #CausalInference #Experimentation #ADCConsulting

1 Comment

Srishtik Dutta

132,705 followers 8mo

🎲 Demystifying Randomness: A Secret Weapon in Algorithm Design! Randomness in CS isn't chaos—it’s structured, powerful, and incredibly useful in optimizing complex problems, especially in competitive programming, simulations, and interviews. Here are 7 powerful randomized techniques that every serious coder should know: 🔢 1. Random Number Generation (RNG) PRNGs use a seed to generate a sequence that looks random but is deterministic. 📌 Use Case: Games, simulations, randomized algorithms. 🎯 2. Weighted Random Picking Assign weights to elements, build prefix sums, and binary search a random value over the total sum. 📌 Use Case: Probabilistic selections like loot drops, biased random decisions. 🔀 3. Fisher-Yates Shuffle Iterate backwards, swap each element with a random previous index. Ensures uniform shuffling in O(n). 📌 Use Case: Randomizing arrays, fair testing, card games. 💧 4. Reservoir Sampling Select k random items from a large/streaming dataset without storing the whole stream. 📌 Use Case: Online systems, log sampling, streaming data. ❌ 5. Rejection Sampling Sample from a simpler distribution, accept/reject based on target-to-proposal ratio. 📌 Use Case: Sampling from custom/complex probability distributions. 📈 6. Monte Carlo Method Approximate values via repeated random simulations. 📌 Use Case: Estimating π, finance models, physics, numerical integrals. 🚫 7. Blacklist Random Sampling Need to exclude elements from a range during random picks? Map blacklisted items to non-blacklisted ones via hashing or swapping. 📌 Use Case: Random unique ID generation, constrained sampling. 💡 Why Learn These? These aren’t just tricks—they unlock scalability, speed, and simplicity in otherwise hard problems. Whether you're optimizing systems, designing randomized tests, or modeling uncertainty, randomness is your ally. Master them, and you'll be amazed at the elegant solutions you can craft.

14 Comments

Dr. Alexander Krannich

Statistician | Clinical Research Expert

17,077 followers 1y

How Many Stratification Factors Are ‘Too Many’ to Use in a Randomisation Plan? Stratification is often used in randomized controlled trials. This means that one or more factors are taken into account in the randomization so that their proportion in the groups is approximately equal. For example, the proportion of gender as a strata can lead to the same proportion of men and women being randomized in both groups. You can see this in the figure when randomized patients are included in the lists one after the other. Occasionally, the study centers, age, and previous illness are also randomized. However, as the number of strata increases, the number of lists required also increases depending on the characteristics of the strata. For gender with 3 characteristics (male, female, diverse), age with 2 characteristics (<65, ≥65), and 10 study centers, this results in 3*2*10 = 60 lists. If the number of patients is relatively small, it is possible that some lists are filled only slightly or not at all, which can lead to unbalanced groups and strata. This problem is also addressed in EMA/CHMP/295050/2013: ‘With an increasing number of strata adjustment for baseline covariates in clinical trials the chance of empty / infrequently occupied strata increases, thus the targeted treatment allocation within strata might not be achieved. ’ The question is therefore: How Many Stratification Factors Are ‘Too Many’ to Use in a Randomisation Plan? Terry M. Therneau answers this question in a publication with the corresponding title. He states that the balance also depends on the method of randomization and that no problems arise with suitable minimization methods if the number of factor combinations remains less than n/2. Do you have experience with stratified randomization? Any questions? #statistics #ClinicalTrials

5 Comments

Carl-Hugo Marcotte

Author of Architecting ASP.NET Core Applications: An Atypical Design Patterns Guide for .NET 8, C# 12, and Beyond | Software Craftsman | Principal Architect | .NET/C# | AI

8,737 followers 1y

Random challenges? Tedious random item selection? Shuffling collection challenges? 🎲Simplify your randomization code with the `Random.Shared`, `GetItems`, and `Shuffle`🎲 The `Random.Shared` property—introduced in .NET 6 (C# 10)—provides a single, shared instance of the `Random` class that can be used concurrently across multiple threads. This thread-safe instance eliminates the need to create separate `Random` instances for each thread, simplifying code, improving performance, and ensuring non-random sequences are avoided (a common pitfall). In addition—starting with .NET 8 (C# 12)—the `Random` and `RandomNumberGenerator` classes introduced the `GetItems` and `Shuffle` methods. The `GetItems` method allows us to retrieve random items from a collection, while the `RandomNumberGenerator` allows us to randomize a collection. 👍Advantages - Simplicity: No need to write custom code for random item selection or shuffling. - Thread-Safety: `Random.Shared` provides a thread-safe instance that can be used concurrently across multiple threads. - Performance: leveraging .NET's implementation instead of rolling your own. - Consistency: Ensures consistent behavior across the board. ✅When to Use - Use the `GetItems` method to retrieve random items from a collection. - Use the `Shuffle` method to shuffle elements in an array or list. - Use the `Random.Shared` property for thread safety and performance instead of creating an instance of the `Random` class. - Use the `RandomNumberGenerator` class for your cryptographic needs. 💡Takeaways - The `Random.Shared` property offers a convenient, thread-safe instance of `Random`, simplifying concurrent random number generation. - The `GetItems` method streamlines retrieving random items from a collection, saving time and improving code clarity. - The `Shuffle` method simplifies randomizing a collection without needing custom code. 📌Tips The `Random` class is not suitable for cryptographic purposes. For secure random number generation, use `RandomNumberGenerator` from the `System.Security.Cryptography` namespace. 💬Comments Have you tried the new `Random.Shared` or methods discussed in this post? Share your experiences and thoughts! #ASPNETCore #dotnet #csharp #CodeDesign #ProgrammingTips #CleanCode

24 Comments

Randomization Procedures

More in Scientific Methodological Standards

Explore categories