Churn Prediction using Apache Spark ML & Databricks
Companies typically spend most of their time on customer acquisition and not on retaining customer. However, the cost for acquiring new customers can be 5 times more expensive than retaining existing ones. Using predictive analytics and machine learning algorithms can help identify which customers are more likely to churn and the reasons of customers churn. The aim is to achieve a 360 degree view of customers.
We will create a customer churn prediction model on telecom data using PySpark. Indeed, companies from these sectors often have a customer service branches with attempt to win-back defecting clients.
Data Overview
The Telecom Customer Churn Data contains 21 features and 7043 customers. The Data is available on Kaggle. To analyze and implement the churn model we used Pyspark on Databricks (Spark 2.3.1).
The .cache() statement load the RDD in memory it will speed up the performance (for 7043 obs. it is not necessary useful).
Data wrangling
We first prepare the Dataset to avoid missing values and categorize numeric features when it's necessary.
Spark SQL is available in Databricks thus you can simply copy-paste your SQL code.
Exploratory Analysis
With Databricks graphics become really easier as example we just group by the variable "churn" to have an overview of number of churner and non-churner:
We can also combined multiple information and create complex graphics:
As we can see the distribution of "Monthly Charges" differs from churner and non-churner.
For summary statistics and correlation analysis we transform the data in a Python Dataframe (using toPandas ()) and analyze the results
Model Building
Churn prediction is a straightforward classification problem thus we will use methods such as the Logistic Regression, Decision Tree and Random Forest from Spark ML.
Before training the classification models we create Pipelines, as in scikit-learn, which consists of different stages (Transformers and Estimators) and are really useful when using Big Data tables.
Pipelines
We encode categorical variables and assemble it with numerical variables.
We finally create the feature variable mandatory for Spark ML:
The model is trained by making associations between the input features and the labeled output associated with these features.
Before fitting models we decide to divide the data set into 2 parts: 70% to train the model and 30% to test it.
Decision Tree
Each models are evaluated with the AUC (Area Under the Curve) of the Receiver Operating Characteristic (ROC) curve on the Test set. We first run the Decision Tree Model:
The AUC obtained is 0.823.
Logistic Regression
The AUC obtained is 0.85. (*) Logistic regression has strong assumptions (e.g. no multi-collinearity).
Random Forest
The AUC obtained is 0.853.
Features Importance
We see that the Total Charges and the Technical Support has a huge impact on client resignation.
Finally, the results are:
Logistic regression seems to be the best model followed by the Random Forest and Decision Tree model.
The code is available in GitHub.