Statistical Hypothesis Tests for Data Science

He Hao

Published Feb 21, 2021

An a data scientist, I am often confused about which statistical test to use for a specific scenario, as there are many statistical tests available such as students T-test, Chi-square test, and ANOVA test, etc. In this issue of Data Science Bytes, I will talk about commonly used statistical tests in data science and how to choose them when conducting hypothesis testing.

Chi-Square Test

A chi-square test is a statistical hypothesis test that is applied when the test statistic is chi-squared distributed. It is generally used to test homogeneity or independence for categorial data.

Example 1 (test of homogeneity): Let’s take A/B testing traffic splitter as an example. The traffic splitter randomly assigns website visitors into 100 bins. After collecting a good amount of data, a chi-square test can be used to test if visitors are uniformly distributed into the 100 bins.

Example 2 (test of independence): Suppose we want to examine two categorical variables - sex (males and females) and smoking habit (smoker and non-smoker), we can use chi-square test to check if these two variables are independent.

Student’s T-test and ANOVA Test

A student’s t-test is a statistical test in which the test statistic follows a Student's t-distribution. Depending how many groups are considered, the following are the different types of tests:

One-sample t-test

In a one-sample test, only one group is considered. We generally use one-sample t-test to check if the population mean is equal to a specified value. For example, to test if a coin is fair or not, we flip it 100 times, and check if the frequency we get head is close to 0.5.

Two-sample unpaired t-test or independent two-sample t-test

In a two-sample unpaired t-test, we compare the mean of two independent groups. For example, we can use two-sample unpaired t-test to check if there is difference in the average weight of 50 women (group A) and 50 men (group B). A/B testing generally falls into this category.

Two-sample paired t-test or dependent two-sample t-test

In a two-sample paired t-test, we compare two population means where the before-and-after observations are on the same group, e.g., checking patients’ cholesterol levels before and after medical treatment.

ANOVA test

If the number of groups is larger than two, we need to leverage one-way ANOVA test. One-Way ANOVA (analysis of variance) test compares the means of two or more independent groups using the F distribution in order to determine whether there is statistical evidence that the associated population means are significantly different.

Pearson’s Correlation

Pearson’s correlation is a correlation coefficient that is commonly used to measure the statistical relationship, or association, between two numerical variables. It is a number between -1 and +1 that indicates to which direction and what extent the two variables are linearly related. As an example, we would expect the age and height of a sample of teenagers to have a Pearson correlation coefficient that is significantly greater than 0, which means that generally the older a teenager is, the more weight he/she has.

How to Choose a Statistical Test

The best way to choose which statistical test to use is asking the following three questions:

Are we trying to compare data or looking for relationship?
What type of data are we dealing with? Categorical or numerical or both?
How many groups are considered and are they independent or not?

Based on the answers to the three questions, you can look up in the above table, and figure out which test to use. That is easy, right?!

To view or add a comment, sign in

Statistical Hypothesis Tests for Data Science

He Hao

Chi-Square Test

Student’s T-test and ANOVA Test

One-sample t-test

Two-sample unpaired t-test or independent two-sample t-test

Two-sample paired t-test or dependent two-sample t-test

ANOVA test

Pearson’s Correlation

How to Choose a Statistical Test

More articles by He Hao

Others also viewed

What is Statistical Inference? 🍕

5 Myths of Statistics Unmasked: Why Data Storytelling Matters

Will the Real Data Scientist Please stand up? 21st Century Craziness

Demystifying Statistics: Random Variables and Distribution Functions

Exploratory Data Analysis (EDA) – Types and Tools

From Data to Insights: Understanding Obesity through Statistical and Exploratory Data Analysis

Bootstrapping Statistics. What it is and why it’s used.

Economist, Data Scientist, Data Miner, Statistician or all of the above?

High-Dimensional Statistics

The Statistical Leap: Moving From Descriptive Reports to Inferential Decisions

Explore content categories

Chi-Square Test

Student’s T-test and ANOVA Test

One-sample t-test

Two-sample unpaired t-test or independent two-sample t-test

Two-sample paired t-test or dependent two-sample t-test

ANOVA test

Pearson’s Correlation

How to Choose a Statistical Test

More articles by He Hao

Bayes’ Theorem

Interesting Facts About Women

Correlation vs. Causation

Charts you need to know in Exploratory Data Analysis

What are DS, MLE and DA?

What is Overfitting and Underfitting?

If X is a discrete uniform random variable, what about X mod K?

A Brief Introduction to A/B Testing

Others also viewed

What is Statistical Inference? 🍕

5 Myths of Statistics Unmasked: Why Data Storytelling Matters

Will the Real Data Scientist Please stand up? 21st Century Craziness

Demystifying Statistics: Random Variables and Distribution Functions

Exploratory Data Analysis (EDA) – Types and Tools

From Data to Insights: Understanding Obesity through Statistical and Exploratory Data Analysis

Bootstrapping Statistics. What it is and why it’s used.

Economist, Data Scientist, Data Miner, Statistician or all of the above?

High-Dimensional Statistics

The Statistical Leap: Moving From Descriptive Reports to Inferential Decisions

Explore content categories