Evidence for causality without A/B testing: A panel based approach

“Is this pure correlation, or does there exist a causal relationship between variables?” Being a data scientist in Silicon Valley, I hear this particular question a lot. And, believe me, I do not have any complaints; since this clearly shows increasing awareness of the distinction between these fundamentally different relationship types.

A/B testing is one of the most robust ways to prove causation. While one needs to be careful reading A/B results, it is still considered to be the safest option. Combine this with the fact that a good portion of companies have the infrastructure needed to run hundreds of A/B tests at a time; you have a clear winner!

Unfortunately, there is more than meets the eye. Certain types of relationships are too difficult or even impossible to test with A/B. For instance, the treatment that needs to be tested might be too expensive/risky to implement; thus very strong evidence is needed even before the product development. Similarly, the impact of interest might be medium/long term and running a test for months is not an option. This is particularly true if we are interested in developing a proxy metric for long term performance.

Under these scenarios and many alike, we turn to our good old snapshot analysis practice and control for as many variables as possible. In other words, we take a picture of our users’ behaviors and their features at a given point in time, and feed it to a statistical model to test our hypothesis. Since we controlled for several key variables, we - with certain confidence - tend to conclude that our analysis is an indicator of causation.

While we do not have a silver bullet that will work for everybody, we have started utilizing panel data to improve accuracy of our analyses and corresponding insights. Panel data includes multiple entities (longitudinal) and each entity will be observed at several points in time (time-series). For example, one can think about a dataset collected by an offline retailer:  The dataset will include multiple customers, and multiple data points for customers who have made purchases on separate occasions.

Why is panel data superior? Let’s think about an example where we are trying to understand drivers’ of customer satisfaction. Our hypothesis is that users who are more engaged will be more satisfied. Why? Because, as they become more engaged, they will see the value of the product/service, hence will be more satisfied.

Let’s, for one second, assume that we have a crystal ball which can show us how our customers form their satisfaction perceptions. For the sake of simplicity, let’s assume that our crystal ball shows us the structure below: 

In simple terms, satisfaction is driven by engagement and motivation; and engagement is also partly driven by motivation. In other words, there is a hidden factor (motivation) which drives both satisfaction and engagement.  If we need to put this into mathematical formulation, it will look like;

satisfaction  =  engagement + motivation

engagement = motivation + other factors

,where we can only observe the data in bold.

(State of the art) snapshot analysis will include a single satisfaction and engagement pair for each member as well as other control variables. For instance, you will observe that customer A is extremely satisfied and he/she took Y many actions in the past Z days. On the contrary, panel data analysis will include customers’ satisfaction and engagement in multiple points in time.

To make our point, we generated data using the model given above; and estimated the relationship between engagement and satisfaction using both snapshot and panel data. As you can see in the graph below, snapshot analysis fits to the data well; however, overestimates the magnitude of relationship (by ~60% in this example). The main reason for overestimation is the fact that snapshot analysis attributes the impact of motivation to engagement. On the other hand, panel data helps us control for the motivation even though we do not observe it directly; and thus, results in an unbiased view.

Yes, this is a cooked up example, and yes, in real life, showing causality without A/B testing is much more difficult. Having said that, we have seen in multiple occasions snapshot analysis resulted in biased/wrong intuition and business outcomes. Running (logistic) regression with control variables is not the answer for all of our problems. With today’s rich panel data, we can do and we should do much better.

Note: Green’s Econometric Analysis is a good resource for those of us who are interested in analyzing panel data.  



Enjoyed your article, Ercan. Thinking out loud, I believe that logistic regression tree predictive modeling could offer causality evidence by illustrating the variable hierarchy that maximizes predictive quality on the validation set. For example, in evaluating web page visit activity and probability of purchase, a split in hierarchy upon visiting a certain page would suggest a change in purchase outcomes (causality). What challenges do you see with this?

Good post! Your regression doesn't have much endogeneity since the slopes are nearly identical. ;) This assumes that you mostly care about the slope.

Great post on the use of panel data Ercan Yildiz. I remember Greene's text fondly :).

To view or add a comment, sign in

More articles by Ercan Yildiz

Others also viewed

Explore content categories