Statistical Hypothesis Tests for Data Science
An a data scientist, I am often confused about which statistical test to use for a specific scenario, as there are many statistical tests available such as students T-test, Chi-square test, and ANOVA test, etc. In this issue of Data Science Bytes, I will talk about commonly used statistical tests in data science and how to choose them when conducting hypothesis testing.
Chi-Square Test
A chi-square test is a statistical hypothesis test that is applied when the test statistic is chi-squared distributed. It is generally used to test homogeneity or independence for categorial data.
Example 1 (test of homogeneity): Let’s take A/B testing traffic splitter as an example. The traffic splitter randomly assigns website visitors into 100 bins. After collecting a good amount of data, a chi-square test can be used to test if visitors are uniformly distributed into the 100 bins.
Example 2 (test of independence): Suppose we want to examine two categorical variables - sex (males and females) and smoking habit (smoker and non-smoker), we can use chi-square test to check if these two variables are independent.
Student’s T-test and ANOVA Test
A student’s t-test is a statistical test in which the test statistic follows a Student's t-distribution. Depending how many groups are considered, the following are the different types of tests:
One-sample t-test
In a one-sample test, only one group is considered. We generally use one-sample t-test to check if the population mean is equal to a specified value. For example, to test if a coin is fair or not, we flip it 100 times, and check if the frequency we get head is close to 0.5.
Two-sample unpaired t-test or independent two-sample t-test
In a two-sample unpaired t-test, we compare the mean of two independent groups. For example, we can use two-sample unpaired t-test to check if there is difference in the average weight of 50 women (group A) and 50 men (group B). A/B testing generally falls into this category.
Two-sample paired t-test or dependent two-sample t-test
In a two-sample paired t-test, we compare two population means where the before-and-after observations are on the same group, e.g., checking patients’ cholesterol levels before and after medical treatment.
ANOVA test
If the number of groups is larger than two, we need to leverage one-way ANOVA test. One-Way ANOVA (analysis of variance) test compares the means of two or more independent groups using the F distribution in order to determine whether there is statistical evidence that the associated population means are significantly different.
Pearson’s Correlation
Pearson’s correlation is a correlation coefficient that is commonly used to measure the statistical relationship, or association, between two numerical variables. It is a number between -1 and +1 that indicates to which direction and what extent the two variables are linearly related. As an example, we would expect the age and height of a sample of teenagers to have a Pearson correlation coefficient that is significantly greater than 0, which means that generally the older a teenager is, the more weight he/she has.
How to Choose a Statistical Test
The best way to choose which statistical test to use is asking the following three questions:
- Are we trying to compare data or looking for relationship?
- What type of data are we dealing with? Categorical or numerical or both?
- How many groups are considered and are they independent or not?
Based on the answers to the three questions, you can look up in the above table, and figure out which test to use. That is easy, right?!