Statistics Without a Calculator
A working knowledge of statistics should be in the skills toolbox of any data scientist. However, most of the methods require a computer or at least a scientific calculator to compute. Or do they? Here is one broadly applicable "trick" I use to estimate whether results of an A/B test are significant that can be estimated in your head.
The setup
Often, when running an A/B test we are comparing the number of successful outcomes between two groups. These could be Ad clicks, conversions, signups, downloads, etc. Often these data are presented something like this:
where "attempts" is the number of users shown a treatment and "success" is the count of our desired outcome. Treatment B was only shown about 10% as often as treatment A and shows a lift of almost 9%. However, the number of "success" for treatment B is small - only 275 events. So the question becomes: is treatment B actually better or could this result have occurred by chance?
Edit: I realized there is a much bigger application - Year over year statistics. Some examples: "Incidents of crime spiked 30% from 10 last year to 13 this year." You can use this trick the same way.
The trick
The statistical test to determine if treatment B is better than A is relatively straight forward. You can, for example use a test based on the Binomial Distribution, but this math is not easy. Notice that of all the raw counts, the 275 "successes" in treatment B is by far the smallest. This means that most of our uncertainty comes from this count. Under some reasonable assumptions we can say that the uncertainty in this value is about equal to its square root. Put another way, if we were to rerun this experiment many times, the standard deviation in the "275" value is about sqrt(275). This means that our relative error is sqrt(275)/275 or about 6%
What does this mean?
It means that our uncertainty is on the order of 6%. Can we therefore conclude that treatment B's observed lift of 8.66% is real? Probably not. To be reasonably sure that the effect is real, the magnitude of the effect should be at least 2 x the magnitude of the uncertainty. The conclusion should probably be: "gather more data to decrease the uncertainty" or perhaps "we are pretty sure the actual lift is less than 12%.
Here's a quick table for noise estimate, sqrt(N)/N:
N noise estimate
10 31%
30 18%
100 10%
300 6%
1,000 3.1%
3,000 1.8%
10,000 1%
So if anyone claims that based on 1,000 observations, they believe there is a 2.5% change in behavior, you can instantly know that this could just as well have been noise.
Summary
Here are the steps to estimate whether your results are significant:
- Find the biggest source of variability (ie the smallest number)
- Estimate your uncertainty sqrt(N)/N
- Compare the size of the effect to the uncertainty. If the magnitude of the effect is > 2x the uncertainty, great! Your effect is probably not due to random chance. If the measured effect is smaller, it may be due to random chance and you should not conclude the effect is real.
Notes for those still reading
- This trick assumes that we are dealing with counts - aka clicks, conversions, downloads etc. It will not work if we are dealing with other measurements (example - average page load time).
- This trick only works if you are using your raw counts to multiply and divide. In this case we are dividing "success" by "attempts" and then dividing B by A to get the lift so we are ok. If we were to add or subtract, the uncertainty estimation is more complicated.
- We are also assuming the events are independent. If a single user can be responsible for many "success" counts then they probably aren't independent and the uncertainty estimate can be off by a lot.
- If there are two significant sources of uncertainty (aka two small counts) that are about the same size, the true uncertainty may be up to 40% higher (square root of 2). This is because to add the uncertainty of two (independent!) values we add them in quadrature.
- You can also use this trick in reverse! If you want to be able to prove an effect of 3% magnitude, you can check the above table and see that you need at least 3,000 "success" counts.
- The above image was "borrowed" from the Boston University Math Department.
I really like your little table for noise estimates! --curious!