Designing Experiments for Machine Learning

Explore top LinkedIn content from expert professionals.

Summary

Designing experiments for machine learning means carefully planning how to test new models or features to reliably understand their impact, whether through A/B tests, causal inference methods, or structured simulations. This process helps ensure that results are trustworthy and meaningful by controlling for confounding factors and choosing the right metrics.

Choose your metrics: Select both primary metrics tied to your goals and counter metrics that safeguard against unwanted side effects before running your experiment.
Validate your groups: Make sure control and treatment groups are similar by comparing key outcomes before you start, so your results aren’t skewed by hidden differences.
Design smarter datasets: Focus on targeted sampling and thoughtful experimental design to cover relevant scenarios, which improves reliability and reduces wasted effort.

Summarized by AI based on LinkedIn member posts

Archy Gupta

SWE III at Google | Tech, AI & Career creator | views = mine | 800K+ Followers | Speaker | Judge | Tech Creator | 2X Featured on Times Square | views = mine

800,835 followers 7mo
Report this post
✅The best part about working at Google❓ Being surrounded by great minds and having the chance to learn from them every day. 🧑💻 Recently, while improving my AI knowledge, I reached out to my colleague Rohit Yadav, a data scientist and SME in causal inference. He helped me understand some tricky concepts and also shared an excellent resource on the topic. 📍Here’s a quick summary of what I learned: 1️⃣. Causal vs. correlation: Just because two things happen together doesn’t mean one causes the other. 2️⃣. A/B Testing: Useful for simple experiments but can miss hidden factors that influence results. 3️⃣. Double Machine Learning (DoubleML): A modern technique that helps figure out what actually causes changes even when data is complex. 4️⃣. Practical examples: It explains how to measure the effect of interventions in real-world scenarios, like testing changes in a recommendation system while accounting for all other factors that might affect user behavior. 👉Think of it like this: you don’t just want to know which recommendation got more clicks; you want to know which change actually caused the increase while controlling for everything else. What stood out to me is how the concepts are broken down into actionable steps, showing exactly how a data scientist can go from a simple A/B test to using DoubleML in practice.🙌 It also highlights common pitfalls, like ignoring confounding variables or misinterpreting results, and provides guidance on how to avoid them - which is incredibly useful for anyone designing experiments or analyzing data. Finally, it uses examples and intuition rather than only theory, so you can see how to apply causal inference methods to real problems without getting lost in heavy math.💯 🔗Check it out here: https://lnkd.in/gUgp6Uid Highly recommended if you want to level up your causal reasoning and data science skills with insights. ✌️ #AI #DataScience #CausalInference #MachineLearning #Google #LearningFromExperts
No more previous content

No more next content
127 Comments
Like Comment
Igor Shuryak, MD, PhD

Quantitative Radiation Biologist/Oncologist | Machine Learning & Causal Inference Practitioner | Columbia University Professor | 140+ Publications | Advancing Cancer Treatment Through AI & Mathematical Modeling

5,574 followers 3mo
Report this post
🔍 Validating Causal ML Models: Why and How to Create Realistic Benchmarks Unlike predictive machine learning (ML) where one can evaluate models on a held-out test set with known true labels, causal inference faces a validation challenge: the true causal effect in real data is unknown. This is not a weakness; it reflects the nature of causal modeling. But without rigorous validation, we risk being misled. So how do we test causal ML models when ground truth is absent in real data? 🧪 A Powerful Option: Semi-Synthetic Simulations This method blends real-world data with known causal structures to create realistic, yet truth-grounded, test environments. 🔄 What It Involves: * Preserve Real Covariates: Keep actual feature sets (X) from empirical data (e.g. medical or biological data sets), maintaining all their correlations, noise, and complexity. * Impose Known Causal Mechanisms: Generate a synthetic outcome (Y*) using a defined function of treatment (T) and selected covariates (X). * Evaluate Model Performance: Test whether the causal model can uncover the imposed treatment effect within the authentic “messiness” of real feature spaces. ⚙️ Why This Approach is Useful: * Often More Realistic Than Fully Synthetic Data: Semi-synthetic benchmarks retain the multivariate complexity of real datasets (e.g., multicollinearity, non-linearities). * Balances Realism with Ground Truth: Provides a known target for evaluation while forcing models to perform under conditions that mirror real data challenges. 🎯 As causal ML grows more sophisticated, validation frameworks should evolve too, toward simulations that respect real-world data structure. * Of course it is important to avoid testing on only a handful of “cherry-picked” scenarios that a priori favor specific models/methods. It is much more informative to design simulations that probe model robustness in different ways, including response shapes or effect magnitudes that have some "adversarial" intent and may challenge each of the compared estimators. I would love to hear from others in #Biostatistics, #Econometrics, #MachineLearning, and #CausalInference: How do you currently validate your causal models? Have you used semi-synthetic simulations? #ComputationalBiology #DataScience #AcademicResearch #ResearchMethodology #ColumbiaUniversity #AI #Innovation #Science

6 Comments
Like Comment
Fan Li

R&D AI & Digital Consultant | Chemistry & Materials

9,643 followers 4mo
Report this post
Multi-objective formulation optimization, too few samples, dismal model performance. Where to go? If you've worked on industrial formulations, you've seen this before: a handful of experiments, properties that fight each other, and models that look great on training data… only to be seriously overfit on noise. A new paper in Chemical Science offers a surprisingly practical account, using multi-objective optimization of self-healing polyurethanes as a concrete case. Rather than hiding the failures, the authors walk through them and turn them into a playbook that you can adapt: Step 1. Start with a random baseline A small, randomly sampled dataset is used to train standard models and then naively expanded with more random experiments. Overfitting dominates, making it clear that random sampling doesn't solve the problem. Step 2. Diagnose failure instead of tuning harder Feature-importance analysis shows that chemically important variables contribute little to predictions, confirming that the models are learning spurious correlations rather than structure–property relationships. Step 3. Redefine the inputs using chemistry-informed descriptors Raw formulation ratios are replaced by a small set of descriptors encoding stoichiometric balance, chain-extender balance, and hard/soft segment ratio. This reduces the experimental design space while encoding known chemical mechanisms. Step 4. Design the dataset instead of sampling blindly A gradient-designed dataset is constructed in descriptor space. With just 9 designed samples, model generalization improves substantially, showing that data quality and coverage matter more than sample count. Step 5. Use Pareto optimization and expand the design space Multi-objective optimization makes trade-offs visible. When progress stalls, key descriptor ranges are widened to explore new regions. Step 6. Consolidate datasets and validate predictions Complementary designed datasets are merged to predict candidates beyond the current Pareto front. But initial experimental validation fails dramatically, signaling extrapolation beyond the covered chemical space. Step 7. Fill gaps, re-optimize, and validate successfully Failures are traced to missing regions of descriptor space. Targeted experiments fill these gaps, after which re-optimization yields predictions that closely match experiments. In total, ~20 samples prove sufficient for this system. Step 8. Confirm physical consistency, convergence, and generalization Structure–property analysis aligns with established polymer physics, further data no longer improves the Pareto front, and the same workflow generalizes on a different polyurethane system. If you're stuck in complex formulation modeling challenges, this paper is worth a careful read. 📄 Chemically-informed active learning enables data-efficient multi-objective optimization of self-healing polyurethanes, Chemical Science, December 23, 2025 🔗 https://lnkd.in/eTAg7QkW
No more previous content

No more next content
1 Comment
Like Comment
Rahul Agarwal

Staff ML Engineer | Meta, Roku, Walmart | 1:1 @ topmate.io/MLwhiz

45,181 followers 1y
Report this post
Doing AB tests is an integral part of a ML engineer life. You have your model changes and you want to test them out in a production environment but what do you need to keep in mind while testing? Here are some of the things you should always do when using any AB system in your company: 1. Primary Metrics: Your North Star: - Define these BEFORE launching - Must directly tie to business objectives - Keep it focused: 2-3 max to avoid diluted insights - Example: CTR, conversion rate, revenue per user 2. Counter Metrics: Guard Rails: - Protect against unintended consequences - Monitor user experience metrics - Watch for negative impacts on related features - Example: If optimizing for clicks, monitor session length to ensure quality or check if other features are being effected by your test. 3. Pre-Bias Analysis: Your Sanity Check - Compare control vs treatment groups BEFORE the experiment - Verify key metrics are similar between groups for past few weeks - No statistical differences should exist 4. Statistical Power: Size Matters - Calculate minimum sample size needed - Account for effect size you want to detect - Consider baseline conversion rates - Example: Detecting a 2% lift might need weeks of data Pro Tip: Document your decision framework before launching your test. It prevents moving goalposts and builds credibility. #MachineLearning #ABTesting #DataScience #ExperimentalDesign

1 Comment
Like Comment
Matteo Courthoud

Senior Applied Scientist @ Zalando

10,030 followers 1y
Report this post
I remember that the first time I had to design an online experiment I realized that it was far less trivial than it looked on paper. Even deciding how to measure outcomes was not trivial, under staggered assignments. Should one use a measurement window of fixed length for each unit (window metric), or a single window for all units, with varying lenghts depending on the assignment timing (cumulative metric)? An example of the first is "revenue in the first week after treatment", while an example of the latter is "revenue during the experiment". Scientists at Spotify have written a very readable paper on the topic, making explicit the trade-offs between these two approaches. Using window metrics makes the results easier to interpret and compare across experiments, while cumulative metrics depend on the experiment duration and on the assignment timing. However, it might take longer to get significant results with window metrics. The authors also highlight a well-known "paradox": with cumulative metrics power can decrease over time. Personally, this choice depends a lot on the use of the estimates. If the goal is backward-looking (e.g. program evaluation), cumulative metrics seem better suited since we get estimates of the total impact for free. If instead the goal is forward-looking, window metrics provide more general and interpretable insights. The reassuring part is that, except for power calculations, you don't have to make these decisions in advance and you can always change your estimand retrospectively. https://lnkd.in/eP4xDDiS
No more previous content

No more next content
7 Comments
Like Comment
Zeyuan Allen-Zhu, Sc.D.

physics of language models @ Meta (FAIR at MSL, not GenAI or TBD)

3,764 followers 3mo
Report this post
This year, my focus is on setting how we should 𝐦𝐞𝐚𝐬𝐮𝐫𝐞, 𝐝𝐞𝐬𝐢𝐠𝐧, 𝐚𝐧𝐝 𝐭𝐫𝐮𝐬𝐭 𝐚𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 for the 𝐧𝐞𝐱𝐭 𝐠𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐨𝐧 𝐨𝐟 𝐀𝐈 𝐦𝐨𝐝𝐞𝐥𝐬. Many conclusions today are drawn from large-scale results that are noisy, benchmark-driven, and sometimes contaminated. My focus here is to turn architectural capabilities into 𝐦𝐞𝐚𝐬𝐮𝐫𝐚𝐛𝐥𝐞 𝐬𝐢𝐠𝐧𝐚𝐥𝐬, not anecdotes. To do this, I build a 𝑠𝑦𝑛𝑡ℎ𝑒𝑡𝑖𝑐 𝑝𝑟𝑒𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑝𝑙𝑎𝑦𝑔𝑟𝑜𝑢𝑛𝑑 — 𝑛𝑜𝑡 𝐿𝐿𝑀-𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑑𝑎𝑡𝑎, but 𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑, 𝑡𝑎𝑠𝑘-𝑑𝑒𝑠𝑖𝑔𝑛𝑒𝑑 setups that isolate skills and remove confounders. 🎓 This is 𝐓𝐮𝐭𝐨𝐫𝐢𝐚𝐥 𝐈𝐈 𝐨𝐟 𝐏𝐡𝐲𝐬𝐢𝐜𝐬 𝐨𝐟 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐌𝐨𝐝𝐞𝐥𝐬, a brand-new tutorial following my ICML 2024 tutorial in Austria, now with a stronger focus on architecture design. Counter-intuitive takeaway: 𝐰𝐞𝐥𝐥-𝐜𝐨𝐧𝐭𝐫𝐨𝐥𝐥𝐞𝐝 ~100𝐌-𝐬𝐜𝐚𝐥𝐞 𝐞𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭𝐬 can reveal architectural mechanisms and failure modes that frontier-scale training often masks. This methodology is the backbone of the entire series: ▶️ First video (𝐭𝐡𝐢𝐬 𝐨𝐧𝐞, 𝟔𝟎 𝐦𝐢𝐧) — methodology & playground design (Part 4.1a) 🔜 Second — architectural principles from the playground (Part 4.1b) 🔜 Third — when the playground reshapes real-life pretraining (Part 4.2) https://lnkd.in/gFw3iv4f

Physics of LM: Part 4.1a, How to Build a Versatile Synthetic Pretrain Playground

https://www.youtube.com/

4 Comments
Like Comment
Sean Taylor

Model Measurement at OpenAI

5,633 followers 2y
Report this post
Very excited to share a new paper that has been a long time in the making. This has been a fun collaboration with my co-authors Ruoxuan Xiong (Emory) and Alex Chin (my co-worker at Lyft and now Motif Analytics). Randomized experiments are the gold standard for measuring causal effects, but in marketplaces we are often testing policies that have many plausible spillovers that make it difficult to learn what we need by assigning treatment across users. Instead we randomize over time. This type of experiment seems simple to design, you are implementing a square wave (a type of oscillator) that determines what policy you are running based on time. When I was at Lyft, we had some heuristics for choosing switchback parameters but we rarely had bandwidth to understand their impact. It turns out to be a rich design space, and by choosing how and when you switch policies, you control the bias and variance of the estimates from your experiment. Intuitively, faster switching yields lower variance by increasing your sample size but increases bias because effects tend to persist over time (carryover effects). Your measurements from each time period are also correlated and have heteroskedastic errors due to seasonality (marketplaces tend to have strong daily and weekly cycles). Our approach is effectively a model-based design process where we use historical data to estimate the inputs to the experimental design process. The data allow us to make informed decisions about switching behavior that will yield the lowest error in our estimates. Carryover effects are the hardest quantity to estimate from historical data because on any individual test they are quite noisy, so pooling is necessary to gain some additional precision. We analyze a corpus of hundreds of switchback tests from Lyft's marketplace, and cluster them into an interpretable distribution over impulse responses. A broader point of this research is that all experimental designs lean on prior knowledge to improve the chances of a successful experiment -- even choosing a sample size for desired power in a standard A/B test. In switchback tests, there is an important bias-variance tradeoff we must manage. Without some means to estimate the covariance of errors and the likely size and shape of carryover effects, it is difficult to design an experiment that is likely to be successful.

9 Comments
Like Comment
Julian Hsu

Airbnb Data Science| Ex-Amazon Economics: Measuring 📐+ Maximizing 📈 Metrics 🚀

7,468 followers 3mo
Report this post
Understand your data and think about your estimation approach. Especially if you have a non-standard experiment like an instrumental (or encouragement) design. These pop up all the time when we want to know the impact of something we cannot control (whether a user clicks on an ad or uses a promotional offer), but we can control how appealing that action is (the ad or promotional offer's ranking). We can randomize how appealing the action is, and use that to create experimental variation in the impact of the action we are interested in. ✅ We should instead follow the standard two-stage setup, known as an instrumental variable (IV) model. In this setup, we strictly only allow how randomly appealing the action is to affect the outcome indirectly through the action. Using simulation data (notebook here: https://lnkd.in/gNAn7vD4), I show that if we naively throw all the features into a single model, we will have a biased and under-valued estimate. 😢 This makes sense, because it allows too much flexibility and allows the randomly appear*ing action to directly impact the outcome. This would take away from the importance of the action we are interested in. #causalinference #econometrics #experimentdesign
No more previous content

No more next content
1 Comment
Like Comment
Ben Labay

CEO @ Speero | Experimentation for growing SaaS, Ecommerce, Lead Gen

19,641 followers 1y
Report this post
Working on a study guide for Contextual Multi-Arm Bandit testing, to make it easier and more accessible. The crux is the setup, data mapping....arms/features etc. My thesis is that as the cost of variations is going to zero, AB testing will shift to model testing more and more. I'm pushing myself and team to understand these approaches more and more. I also think we shouldn't just allow tech to do this automagically, rather really understand the setup. Google doc for feedback and collaboration on this - Test Plan: Contextual Multi-Armed Bandit (CMAB) Experimentation for Product Recommendations: https://lnkd.in/gACFcseh Problem Statement: Traditional approaches to product recommendations (static A/B testing or fixed rules) fail to adapt as user preferences shift. We need a method that: > Learns which offers perform best in different contexts > Continuously optimizes instead of waiting for a test to "end" > Handles concept drift (e.g., seasonality, pricing changes, new behaviors) Hypothesis: By applying CMAB to product recommendations, we predict: > Higher conversion rates vs. static A/B tests > Faster adaptation to new user behaviors > More efficient decision-making at scale Experiment Design > Exploration Phase: Initial even distribution of recommendations > Adaptive Learning: CMAB dynamically favors high-performing offers > Control Groups: Compare against A/B and rule-based allocations Metrics for Success > Primary KPI: Conversion rate on recommendations > Secondary KPIs: Engagement (time on site, feature usage), revenue impact > Model Adaptability: How quickly it adjusts after a major context shift

Copy of CMAB Study design docs.google.com

17 Comments
Like Comment

Designing Experiments for Machine Learning

Summary

Physics of LM: Part 4.1a, How to Build a Versatile Synthetic Pretrain Playground

https://www.youtube.com/

More in Experimental Design In Science

Explore categories