A Brief Introduction to A/B Testing
What is A/B testing?
A/B testing is a method of comparing two or more versions of a webpage or app to see which one performs better. It is a statistical experiment where different variants are randomly shown to users to see which one is more effective based on predefined metrics. It is very useful to understand user engagement and satisfaction of a new feature or product. Let's take web UI design shown in the following picture as an example. We want to investigate the effect of different color of a button on the landing page on the click through rate (CTR). We consider the original design with red button as A (control) and the new design with green button as B (treatment or variation). We want to test if the new design performs better in terms of the CTR metric.
What are null hypothesis and alternative hypothesis?
In A/B testing, hypotheses include alternative hypothesis and null hypothesis. The alternative hypothesis is generally something you believe, and the null hypothesis is the contrary, which is a conjecture stating the “treatment” has no impact on the “metric”. Let's use the web UI design example again. To test if changing the color of a button from red to green will result in higher CTR, we formulate the alternative hypothesis that the page with green button will have higher CTR, and the null hypothesis that there will be no effect on the CTR. The next question is how to test our hypotheses using statistics. This leads us to the discussion of p-value.
What does p-value mean?
P-value is the probability of obtaining the observed, or more extreme, results, under the assumption that the null hypothesis is true. A graphical representation of p-value is shown in the following picture. The smaller the p-value, the stronger the evidence that we should reject the null hypothesis since the likelihood we get the observed results assuming the null hypothesis is true is very low. To determine if a p-value is statistically significant, we compare it with a significance level. Typically, 0.05 is used as the significant level in most experiments, though the choice depends on the specific use case. A p-value less than or equal to 0.05 indicates strong evidence against the null hypothesis, and we should reject the null hypothesis, and accept the alternative hypothesis.
Steps to run an A/B testing experiment
- The following are typical steps to run an A/B testing experiment:
- Setup expectations and experiment requirements.
- Formulate alternative and null hypotheses.
- Define key metrics.
- Run experiment and collect data.
- Perform data analysis and check if p-value is statistically significant.
- Refine and iterate the experiment.
Let's put everything together
In this section, we will use a concrete example and do the math to demonstrate how A/B testing works. Again, we use the web UI design example. We believe that the new feature B will improve the CTR and propose the alternative hypothesis that B will have larger CTR, and the null hypothesis that it will not. Suppose designs A and B are randomly shown to 2000 users and each has 1000 impressions. The clicks for them are 23 and 37 times, respectively. According to probability theory, we can model the number of clicks, X, as a Binomial Distribution B(n, p), where n is the number of trials and p is the probability of success in each trial. Assuming the null hypothesis is true, we estimate that p = 23/1000 = 0.023. Now let's calculate the probability of having 37 clicks given 1000 impressions. The p-value is given by,
We observe the p-value is much smaller than the default significance level, 0.05. Hence, we reject the null hypothesis and believe that feature B has higher CTR than feature A.