Statistical Hypothesis Tests for Data Science
Photo by Stephen Phillips

Statistical Hypothesis Tests for Data Science

An a data scientist, I am often confused about which statistical test to use for a specific scenario, as there are many statistical tests available such as students T-test, Chi-square test, and ANOVA test, etc. In this issue of Data Science Bytes, I will talk about commonly used statistical tests in data science and how to choose them when conducting hypothesis testing.

No alt text provided for this image

Chi-Square Test

A chi-square test is a statistical hypothesis test that is applied when the test statistic is chi-squared distributed. It is generally used to test homogeneity or independence for categorial data.

Example 1 (test of homogeneity): Let’s take A/B testing traffic splitter as an example. The traffic splitter randomly assigns website visitors into 100 bins. After collecting a good amount of data, a chi-square test can be used to test if visitors are uniformly distributed into the 100 bins.

Example 2 (test of independence): Suppose we want to examine two categorical variables - sex (males and females) and smoking habit (smoker and non-smoker), we can use chi-square test to check if these two variables are independent.

Student’s T-test and ANOVA Test

A student’s t-test is a statistical test in which the test statistic follows a Student's t-distribution. Depending how many groups are considered, the following are the different types of tests:

One-sample t-test

In a one-sample test, only one group is considered. We generally use one-sample t-test to check if the population mean is equal to a specified value. For example, to test if a coin is fair or not, we flip it 100 times, and check if the frequency we get head is close to 0.5.

Two-sample unpaired t-test or independent two-sample t-test

In a two-sample unpaired t-test, we compare the mean of two independent groups.  For example, we can use two-sample unpaired t-test to check if there is difference in the average weight of 50 women (group A) and 50 men (group B). A/B testing generally falls into this category.

Two-sample paired t-test or dependent two-sample t-test

In a two-sample paired t-test, we compare two population means where the before-and-after observations are on the same group, e.g., checking patients’ cholesterol levels before and after medical treatment.

ANOVA test

If the number of groups is larger than two, we need to leverage one-way ANOVA test. One-Way ANOVA (analysis of variance) test compares the means of two or more independent groups  using the F distribution in order to determine whether there is statistical evidence that the associated population means are significantly different.

Pearson’s Correlation

Pearson’s correlation is a correlation coefficient that is commonly used to measure the statistical relationship, or association, between two numerical variables. It is a number between -1 and +1 that indicates to which direction and what extent the two variables are linearly related. As an example, we would expect the age and height of a sample of teenagers to have a Pearson correlation coefficient that is significantly greater than 0, which means that generally the older a teenager is, the more weight he/she has.

How to Choose a Statistical Test

The best way to choose which statistical test to use is asking the following three questions:

  1. Are we trying to compare data or looking for relationship?
  2. What type of data are we dealing with? Categorical or numerical or both?
  3. How many groups are considered and are they independent or not?

Based on the answers to the three questions, you can look up in the above table, and figure out which test to use. That is easy, right?!

To view or add a comment, sign in

More articles by He Hao

  • Bayes’ Theorem

    If someone asks me to name one thing about probability and statistics, I will probably mention Bayes’ theorem. It is a…

  • Interesting Facts About Women

    Happy International Women’s day! In this issue of Data Science Bytes, we will feature women and present some of the…

    4 Comments
  • Correlation vs. Causation

    In data analysis, we often observe that two variables are related: one variable varies when the other changes. This…

  • Charts you need to know in Exploratory Data Analysis

    A picture is worth a thousand words. A simple graph brings more information to a data scientist’s mind than any other…

  • What are DS, MLE and DA?

    Data Scientist (DS) is a relatively new job title, and people sometimes are confused about its difference from machine…

  • What is Overfitting and Underfitting?

    Overtime, we hear Data Scientists and Machine Learning Engineers talk about overfitting, underfitting, high variance…

  • If X is a discrete uniform random variable, what about X mod K?

    A motivating example When you read this question, your first reaction might be why on earth should I care about it?…

    1 Comment
  • A Brief Introduction to A/B Testing

    What is A/B testing? A/B testing is a method of comparing two or more versions of a webpage or app to see which one…

Others also viewed

Explore content categories