Correlation vs. Causation
In data analysis, we often observe that two variables are related: one variable varies when the other changes. This relationship might lead us to assume that a change to one variable causes the change in the other variable. For example, whenever we see people on the street carrying umbrellas, very soon it starts raining. Should we conclude that umbrellas cause the rain? No, correlation does not (always) imply causation. This issue of Data Science Bytes clarifies the difference between correlation and causation and explain why A/B testing is important to make causal claim.
Correlation
Correlation is a statistical relationship that measures the relationship between two numerical variables, no matter they are causal or not. Pearson’s correlation coefficient is a number between -1 and +1 that measures to which direction and what extent the two variables are linear related. The sign of the correlation coefficient represents positive or negative relationship, and its value represents the strength of relationship. Correlation coefficient that is close to +/- 1 indicates strong relationship, and correlation coefficient that is close to 0 means weak or no relationship.
For example, we would expect the age and height of a sample of teenagers to have a correlation coefficient that is close to +1, which means that generally the older a teenager is, the more weight he/she has.
Causation
Causation refers to the relationship of cause and effect. It is the influence by which one object (a cause) contributes to the production of another object. Causation explicitly applies to cases where action A causes outcome B. On the other hand, correlation is simply a relationship. Action A relates to Action B—but one event doesn’t necessarily cause the other event to happen. Causation and correlation can exist at the same time. However, correlation does not imply causation and causation does not imply (linear) correlation either (e.g., X~N(0,1) and Y = X^2).
In the following picture, we see that the sales of ice cream and the cases of sunburn are strongly correlated, but one does not cause the other. Instead, the cause for both is the weather.
In the following picture, we see that the sales of ice cream and the cases of sunburn are strongly correlated, but one does not cause the other. Instead, the cause for both is the weather.
Causation and A/B testing
When developing a new product or feature, we hope it will improve certain business metrics such as adoption, retention, and churn, etc. However, how do we make such a causal claim: a change in metrics is caused by changes introduced by the new feature or product? In theory, to test if A caused B, we need to satisfy the following three conditions:
- Relationship. First of all, we need to observe a relationship between them such as strong correlation. Although correlation does not necessarily imply causation, we do need some kind of relationship to make a causal claim.
- Time order. The order of time should be right, i.e., to test if A caused B, we have to make sure A happened before B.
- Ruling out other explanations. More importantly, we need to make sure there is no other explanation for the relationship we observed between A and B.
The first and second conditions are easy to satisfy. However, how do we rule out all other explanations? The answer is A/B testing. In an A/B testing or randomized controlled experiments, randomization plays a key role in randomly assigning visitors into control and treatment groups to rule out any single possible alternative explanation.
To make scientific causal claims, let’s do A/B testing on your new features.