Analyzing churn rates for Codeflix using SQL
1. Introduction
This project was part of a course on CodeCademy where I used SQL to explore churn rates in a fictional company. The churn rate is a percent of subscribers at the beginning of a period that cancel within that period.
2. Goals
Codeflix, a streaming video startup, is interested in measuring their user churn rate. In this project, I’ll be helping them answer these questions about their churn:
1. Get familiar with the company.
- How many months has the company been operating?
- Which months do you have enough information to calculate a churn rate?
- What segments of users exist?
2. What is the overall churn trend since the company started?
3. Compare the churn rates between user segments.
- Which segment of users should the company focus on expanding?
3. Context and data
Four months into launching Codeflix, management asks me to look into subscription churn rates. It’s early on in the business and people are excited to know how the company is doing.
The marketing department is particularly interested in how the churn compares between two segments of users. They provide me with a dataset containing subscription data for users who were acquired through two distinct channels.
The dataset provided to me contains one SQL table, subscriptions. Within the table, there are 4 columns:
id - the subscription id
subscription_start - the start date of the subscription
subscription_end - the end date of the subscription
segment - this identifies which segment the subscription owner belongs to
Codeflix requires a minimum subscription length of 31 days, so a user can never start and end their subscription in the same month.
4. Getting familiar with the data
First, I take a glimpse at the subscriptions table.
Now, I am more familiar with the data and know, for example, that the two existing segments are 87 and 30.
To determine the range of months of data provided and which months I will be able to calculate churn for, I performed the following query.
Recommended by LinkedIn
There is data available for four months: December 2016, January 2017, February 2017, and March 2017.
5. Calculating churn rate for each segment
I will calculate the churn rate for both segments (87 and 30) over the first three months of 2017 (I can’t calculate it for December since there are no subscription_end values yet). To do this, I will create different temporary tables using WITH.
5.1. Creating a temporary table for months.
5.2. Creating a temporary table from subscriptions and months tables.
5.3. Creating a temporary table for users’ status.
In this temporary table, I will identify if a user was active before the beginning of each month and if that user canceled the subscription. I will discriminate the analysis by segments and use CASE WHEN assigning 1 for actual cases and 0 otherwise.
5.4. Creating a temporary table with aggregate values.
Finally, I will create a temporary table that shows the SUM of the active and canceled subscriptions for each segment, for each month.
Notice that I manually coded the segment numbers in the previous temporary table, status, and this one. The process could be more complex if there were more segments. To avoid hard-coding in such cases, I could evaluate active and canceled users in general in the status table and then add a group by segments in the status_aggregate.
5.6. Calculating churn rates.
Now I can calculate the churn rates for the three months in general and by segments.
6. Analysis and recommendations
Using SQL I discovered that there is a clear upward trend in the churn rate from January to March. This indicated that a growing proportion of customers are leaving each month. Understanding why more customers are leaving overtime is crucial.
Something that stands out from the churn rates is that for March (churn_rate at approximately 27% and churn_rate_87 at approximately 49%), the churn rates are relatively high, which could be alarming. It's important to investigate what happened in that month that caused such a significant increase in churn.
If this pattern repeats every year, it could be a seasonal trend. However, since this is a new company, we don’t have enough data to determine if this is a seasonal trend. Another possibility is that there might have been an event or issue that caused a spike in churn.
Another important observation is that the churn rates for segment 87 are significantly higher than the churn rates for segment 30. The company needs to identify what they are doing differently for each segment to see what drives the higher churn rates in segment 87.
To conclude this analysis, I will suggest to the company the following points:
7. Recapitulation
During this project, I demonstrated my skills and performance using SQL. Some commands used in the project were: SELECT, FROM, LIMIT, GROUP BY, CROSS JOIN, CASE WHEN, WITH, AND, OR, IS NULL.