Hypothesis Testing: A/B Tests Explained

Trist'n J.

Published Jul 23, 2020

One important goal of statistical analysis is to find patterns in data and then apply these patterns in the ‘real world’. In fact, machine learning is often defined as the process of finding and applying patterns to large sets of data. With this new ability to find and apply patterns, many processes and decisions in the world have become extremely data-driven. Think about it; when one views or buys an item from Amazon, they often then see recommended products that Amazon suggests they might like.

Now, Amazon is not performing magic. Rather, they have built a recommendation system using information gathered from their users about what products they view, what products they like, and what products are purchased. There are many factors which can determine whether one ‘might like’ a product and then purchase it. These can include previous searches, the frequency of the current search, user demographics and even the time of day. Are persons more likely to click the purchase button if it were a calming colour such as blue versus if it were an aggressive colour such as red? Well, that can be found by analyzing the patterns within data.

The thing is, it is difficult to determine an appropriate pattern when the data are subject to random noise. This is because random noise can produce patterns just by chance. Since this difficulty exists, analysts must use all the appropriate tools and models to make inferences from their data. One quite common and rigid way of determining whether a pattern has occurred by chance is performing a hypothesis test.

Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. What this means is that data can be interpreted by assuming a specific outcome and then using statistical methods to confirm or reject the assumption. The usual process of hypothesis testing consists of four steps. First, hypotheses must be developed. The null hypothesis refers to something that is assumed to be true and it is commonly the fact that the observations are the result of pure chance. The alternative hypothesis refers to something that is being tested against the null, and it is commonly that observations show a real effect combined with a component of chance variation.

Next, the test statistic must be decided. This is the method and value which will be used to assist in determining the truth value of the null hypothesis. There are many test statistics which can be used, and the most appropriate one will be dependent on the hypothesis test being carried out. Once the test statistic is found, one can then calculate the p-value. The p-value is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis was true. In other words, it is the probability to the right of the respective test statistic. The benefit of the p-value is that it can be tested at any desired level of significance, alpha, by comparing this probability directly with alpha; and this is the final step of hypothesis testing.

Alpha refers to how much ‘confidence’ is placed in the results. With alpha at 5%, it means that there is a 95% level of confidence placed in the results. When comparing the p-value to alpha, the null hypothesis is ruled out once the p-value is less than or equal to alpha. In general, lower p-values are preferred. This is because a low p-value means that there is a smaller probability of witnessing an observation as extreme as the one being tested if the null hypothesis were to be true. Essentially, p-values gauge how consistent sample statistics are with a given null hypothesis. Therefore, if the p-value is small enough, it can be concluded that the sample is incompatible with the null hypothesis and the null hypothesis can be rejected.

Now, back to the question about whether persons are more likely to click the purchase button if it were blue versus if it were red. I still do not know, but scenarios like this are carried out on large scales quite frequently in data-driven businesses. This is a form of hypothesis testing and it is used to optimize a particular feature of a business. It is called A/B testing and refers to a way of comparing two versions of something to figure out which performs better. It involves showing two variants of the same product or feature to different segments of the business user-base at the same time and then determining which variant is more successful through the use of success and tracking metrics.

A/B testing is often associated with websites and apps, and it is extremely common on large social media platforms. This is because the platform’s conversion rate (how many persons saw something and then clicked it) can largely determine the platform’s fate. Therefore, every piece of content that a platform’s user can see needs to be optimized to achieve its maximum potential.

The process of A/B testing is identical to the process of hypothesis testing previously explained. It requires analysts to conduct some initial research to understand what is happening and determine what feature needs to be tested. At this point, the analyst can also determine what are the success and tracking metrics because they would have used these statistics to understand the trend of the observations. After this, the hypotheses will be formulated. Without these hypotheses, the testing campaign will be directionless. Next, variations of the testing feature will be randomly assigned to users. Results are then collected and analyzed, and the successful variant will be deployed.

Lastly, let us examine a hypothetical A/B test. Consider a large social media platform which has both individual users who share content about their lives, as well as companies which share important information such as company updates or world news. It is seen that user engagement on company content is low, and this is an issue because the platform wants to ensure that its user-base is as up to date as possible with what is happening around the world. Logically, the goal is to develop a plan to increase user engagement on company content.

It could be reasonable to assume that engagement might low because company content is buried in personal content, and users are not immediately aware that they are browsing through two different types of content. Because of this, engagement could increase if company content were to be separated from personal content and then placed on a “news page” for itself. This way, users will know for sure what type of content they are viewing, and they might spend more time understanding the world around them; thus, increasing engagement. Therefore, the null hypothesis could be that the difference between average engagement on the redesign and the average engagement on the original design is no different from zero. The alternative hypothesis would then be that the difference between the means is significantly higher than zero.

A success metric for this test would be the number of users (from the testing sample) who visit this “news page”. The reason is that this redesign can only be successful if users visit and consume content on that page. A tracking metric could then be the watch-time per user. This is because it needs to be determined whether users are engaging with content once they reach to the page, or if they are landing on the page (by accident or so) and immediately leaving.

If it is found that the engagement on the redesign is significantly higher and that it is not by chance, then the redesign should be implemented for the entire platform. It should be noted that the example is a simplified version of the A/B testing process, but the concepts can still be applied.

Welcome to the wonderful world of hypothesis testing!

References:

ab-testing/#how-to-perform-an-a-b-test

machinelearningmastery.com/statistical-hypothesis-tests/

mathworld.wolfram.com/HypothesisTesting.html

ncbi.nlm.nih.gov/pmc/articles/PMC5991789/

statisticsbyjim.com/hypothesis-testing/interpreting-p-values/

Other useful material:

amazon.com/Introducing-Statistics-Graham-J-G-Upton/dp/0199148015

optimizely.com/optimization-glossary/ab-testing/#:~:text=AB%20testing%20is%20essentially%20an,for%20a%20given%20conversion%20goal.

researchgate.net/post/how_to_interpret_P_values

towardsdatascience.com/statistical-tests-when-to-use-which-704557554740