🛑 Spoiler alert, plot A and plot B are actually from the same data set. 🛑 You just finished an A/B test using eye tracking and you ran a t test. You look at the p value and bar plot (Plot A)… nothing. No significant difference. You sigh, assume the design change did not matter, and start writing your report. Then, just to double check, you plot the histogram (Plot B) for both groups. And your jaw drops. One group is heavily skewed to the right. The other is skewed to the left. Their medians are far apart. The only thing that happened to be the same was the mean. In that moment you realize something very painful. You picked the wrong analysis and the wrong plot. The t test completely missed the effect because it only compares means, and your means were so close. But the actual distributions were dramatically different. The effect was very real and actually very large. This happens more often than people admit. It has happened in many UX studies and the results have been misleading. Wrong conclusions, wrong design decisions, and sometimes wasted weeks of research. Not because the data were bad, but because the wrong statistical test was used. This is why you must always check your distributions before you choose your test (Please do it!). The shape of your data matters. When your data are clean and roughly normal, parametric tests like the t test work beautifully. But UX data rarely behave that nicely. Eye tracking metrics, time on task, fixation counts, click delays, stress ratings, Likert responses… they are often skewed, heavy tailed, or full of outliers. When that happens, parametric tests lose accuracy. They can completely miss real effects. Non parametric tests do not have this problem because they compare the overall distribution and the median, not just the mean. Parametric tests, like the independent samples t test, assume normal data and equal variances. They only check whether the means differ. Non parametric tests, like the Mann Whitney U test, do not need normality. They pick up differences in medians and differences in overall distribution shape. They are often the correct choice in real UX research. Here is what happened in our example. We created two datasets with the exact same mean. One was right skewed. The other was left skewed. The medians were very different. 🔴 Results from the parametric t test t statistic: 0.00000000000000531 p value: 0.9999999999999958 Conclusion: no difference 🟢 Results from the non parametric Mann Whitney U test U statistic: 102112 p value: 0.000000538 Conclusion: very strong difference The t test completely failed. The non parametric test detected a very real and important effect. This is the entire point. If you only look at means, you are flying blind. If you check distributions first and choose the correct test, you avoid costly mistakes that have already ruined many UX studies.
How to Interpret A/b Testing Results
Explore top LinkedIn content from expert professionals.
Summary
Interpreting A/B testing results means understanding what the data is actually telling you about the differences between two versions—whether those differences are real, meaningful, and likely to last. A/B testing is a way to compare two options, but making sense of the results requires more than just looking at averages or statistical labels.
- Examine the data distribution: Always look at how your data is spread out before choosing a statistical test, since relying only on averages or means can hide important differences between groups.
- Check for subgroup effects: Break down your results by different types of users or customers to make sure a win for the overall group isn’t hiding a loss for an important segment.
- Consider confidence intervals: Pay attention to the range of possible outcomes your test results suggest, since a wide range means there’s more uncertainty about what will actually happen if you roll out the change.
-
-
A company runs an A/B test. Version B wins—12% lift, statistically significant. Champagne. 🎉 Six months later, revenue is flat. What happened? They averaged over their customers. Rookie move. (I've done it too.) Version B: +20% for new users. But -8% for returning customers. New users outnumbered returners in the test, so B "won." Then the customer mix shifted. More returners. The "winning" variant was slowly bleeding its best users. This is Simpson's Paradox—when aggregate trends reverse at the subgroup level. It's not exotic. It's everywhere. Data-driven teams walk into this constantly when the first rule of being data-driven is "run the test and trust the average." The fix isn't more data. It's asking: for whom did each version win? Averages describe populations. They don't describe people. The most dangerous phrase in analytics isn't "we don't have data." It's "the data is clear." For those wrestling with weird A/B results—I see you! Ask who's in your sample before you pop the champagne.
-
How can an A/B test be “statistically significant” but not be totally trustworthy? I’ve been wrestling with this question for over a decade. Through extensive research, I've now come to understand that a big part of the answer lies in confidence intervals. Here’s the simplest way to explain it: Imagine you run an A/B test with very small numbers: 🚥 Version A: 82 visitors, 3 conversions 🚦 Version B: 75 visitors, 12 conversions The math shows the result is statistically significant. ⚡ The p-value is 0.0088, well below the common p < 0.05 threshold ⚡ Observed power is reported as 95.09%, well above the standard 80% rate The results look convincing. Statistical significance, long treated as the gold standard, has been achieved! But, here's the problem. Statistical significance can be "gamed" with low traffic tests because it only answers one narrow question: 🔦 If there were truly no difference between versions, how likely is it this result would happen by chance? That’s it. That's all statistical significance answers. It doesn't tell you whether the result is stable or repeatable. And, as you can imagine, with tiny samples, like 3 vs. 12 conversions, you get exaggerated effects. Every single conversion has an outsized influence. One or two people behaving differently can completely flip the outcome. ➡️ This is where confidence intervals come in. A confidence interval is the range of outcomes that could reasonably be true given the data. In small tests, that range is really wide. So the actual conversion effect might be smaller or larger than what you achieved in the test. You can't know with precision. So you don't have a dependable estimate of how big the improvement really is, or whether the result would hold if you ran the test again. It's important to realize, a confidence interval is not the same as a confidence level. Remember: a confidence interval is the range of values that could reasonably be true given the data. A label of “95% confidence” describes how that range was constructed, not how certain or correct the result is. Which means, a 95% confidence interval can still be very wide, creating substantial uncertainty around the estimate. When there's such uncertainty, the numbers may appear exaggerated. That's where Twymann’s Law comes in. It states, anything that looks interesting or unusual is usually wrong. In small samples, results are extreme because the noise does most of the work. So while a statistically significant difference can be measured in a small-sample study, you can't reliably measure how big that difference actually is. That's why 3 vs. 12 conversions often fail to replicate once more data is collected. 📣 Call to action for 2026: Run tests that are not only statistically significant, but also have a large enough sample size to produce narrow confidence intervals, so you can not only detect effects, but also estimate them precisely enough to make accurate, trustworthy test decisions.
-
🚨 Your A/B test results are not the real impact. A happy PM runs an A/B test → sees a +15% lift in revenue → scales the feature to all users → shares the big win in Slack 🎉 But… once the feature is fully rolled out, the KPI impact isn’t there. Why? Because test results often don’t reflect the true long-term effect. Here are a few reasons why this happens: 1️⃣ Confidence intervals matter → That “+15%” is actually a range. The lower bound might be close to zero. 2️⃣ Novelty effect → Users are excited at first, but the effect fades as they get used to the change. 3️⃣ Experiments aren’t additive → Three +15% lifts don’t stack to +45%. There’s a ceiling, and improvements often cannibalize each other. 4️⃣ Sample ≠ population → The test group might not represent your entire user base. For example, you have more high-intent users in the variant. 5️⃣ Time-to-KPI effects → We see that a lot, especially in conversion experiments. The experiment could improve the time to conversion, so when you close the experiment, it seems like you’re winning, but actually if you monitor the users a few days/weeks after the experiment ends, there are no differences in total conversions between the variant and the control. 6️⃣ Type I error → With P-value=0.05 (or worse, 0.1), there’s still a decent chance the “win” is a false positive. 👉 That’s why tracking post-launch impact is just as important as running the experiment itself. Methods like holdout groups, simple correlation tracking, or causal inference models (building synthetic control) help reveal the real sustained effect.
-
Day 6 - CRO series Strategy development ➡ A/B Testing (Part 2) Running an A/B test is just the first step. Understanding the results is where the real value lies. Here’s how to interpret them effectively: 1. Check for Statistical Significance Not all differences are meaningful. Look at the p-value (probability of results happening by chance): ◾ p < 0.05 → Statistically significant ◾ p < 0.01 → Strong statistical significance If the result isn’t statistically significant, it’s not reliable enough to act on. 2. Use Confidence Intervals A confidence interval tells you the range in which the true effect likely falls. ◾ Wide interval → Less certainty ◾ Narrow interval → More precise estimate Tighter confidence intervals indicate a clearer difference between variations. 3. Consider Business Context Numbers don’t exist in isolation. Example: ◾ Click-through rate increases, but conversions don’t? There may be an issue further down the funnel. ◾ More sign-ups but lower retention? You might be attracting the wrong audience. Always tie insights back to business goals. 4. Monitor Guardrail Metrics A test should improve performance without creating new issues. ◾ Higher click-through rates but also higher bounce rates? Something’s off. ◾ Increased conversions but lower customer satisfaction? A long-term risk. Look beyond the primary metric to avoid unintended consequences. Why A/B Testing Matters ✔ Increases Engagement – Find what resonates with your audience ✔ Improves Conversions – Optimize key elements for better performance ✔ Enables Data-Driven Decisions – Move beyond assumptions ✔ Encourages Continuous Improvement – Always refine and optimize See you tomorrow! P.S: If you have any questions related to CRO and want to discuss your CRO growth or strategy, Book a consultation call (Absolutely free) with me (Link in bio)
-
founder learnings! part 8. A/B test math interpretation - I love stuff like this: Two members of our team (Fletcher Ehlers and Marie-Louise Brunet) - ran a test recently that decreased click-through rate (CTR) by over 10% - they added a warning telling users they’d need to log in if they clicked. However - instead of hurting conversions like you’d think, it actually increased them. As in - Fewer users clicked through, but overall, more users ended up finishing the flow. Why? Selection bias & signal vs. noise. By adding friction, we filtered out low-intent users—those who would have clicked but bounced at the next step. The ones who still clicked knew what they were getting into, making them far more likely to convert. Fewer clicks, but higher quality clicks. Here's a visual representation of the A/B test results. You can see how the click-through rate (CTR) dropped after adding friction (fewer clicks), but the total number of conversions increased. This highlights the power of understanding selection bias—removing low-intent users improved the quality of clicks, leading to better overall results.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development