Churn Prediction using Apache Spark ML & Databricks

Alexandra LORENZO DE BRIONNE

Published Dec 27, 2018

Companies typically spend most of their time on customer acquisition and not on retaining customer. However, the cost for acquiring new customers can be 5 times more expensive than retaining existing ones. Using predictive analytics and machine learning algorithms can help identify which customers are more likely to churn and the reasons of customers churn. The aim is to achieve a 360 degree view of customers.

We will create a customer churn prediction model on telecom data using PySpark. Indeed, companies from these sectors often have a customer service branches with attempt to win-back defecting clients.

Data Overview

The Telecom Customer Churn Data contains 21 features and 7043 customers. The Data is available on Kaggle. To analyze and implement the churn model we used Pyspark on Databricks (Spark 2.3.1).

The .cache() statement load the RDD in memory it will speed up the performance (for 7043 obs. it is not necessary useful).

Data wrangling

We first prepare the Dataset to avoid missing values and categorize numeric features when it's necessary.

Spark SQL is available in Databricks thus you can simply copy-paste your SQL code.

Exploratory Analysis

With Databricks graphics become really easier as example we just group by the variable "churn" to have an overview of number of churner and non-churner:

We can also combined multiple information and create complex graphics:

As we can see the distribution of "Monthly Charges" differs from churner and non-churner.

For summary statistics and correlation analysis we transform the data in a Python Dataframe (using toPandas ()) and analyze the results

Model Building

Churn prediction is a straightforward classification problem thus we will use methods such as the Logistic Regression, Decision Tree and Random Forest from Spark ML.

Before training the classification models we create Pipelines, as in scikit-learn, which consists of different stages (Transformers and Estimators) and are really useful when using Big Data tables.

Pipelines

We encode categorical variables and assemble it with numerical variables.

We finally create the feature variable mandatory for Spark ML:

The model is trained by making associations between the input features and the labeled output associated with these features.

Before fitting models we decide to divide the data set into 2 parts: 70% to train the model and 30% to test it.

Decision Tree

Each models are evaluated with the AUC (Area Under the Curve) of the Receiver Operating Characteristic (ROC) curve on the Test set. We first run the Decision Tree Model:

The AUC obtained is 0.823.

Logistic Regression

The AUC obtained is 0.85. (*) Logistic regression has strong assumptions (e.g. no multi-collinearity).

Random Forest

The AUC obtained is 0.853.

Features Importance

We see that the Total Charges and the Technical Support has a huge impact on client resignation.

Finally, the results are:

Logistic regression seems to be the best model followed by the Random Forest and Decision Tree model.

The code is available in GitHub.

To view or add a comment, sign in

Churn Prediction using Apache Spark ML & Databricks

Alexandra LORENZO DE BRIONNE

Data Overview

Data wrangling

Exploratory Analysis

Model Building

More articles by Alexandra LORENZO DE BRIONNE

Others also viewed

My First Take on Data Lakes

Self-Healing Apache Spark Pipelines: Architecting AI-Driven Resilient Data Processing

Empowering Data Analysis with LangChain-OpenAI in a Microsoft Fabric Notebook

🚀 Supercharge Your Data Pipelines: PySpark, Airflow, and the Path to Scalability

Databricks – The Unified Analytics Platform for All Your Data & AI Needs 🧠🔥

PySpark – Dynamic Partition Pruning

Real-Time Analytics - Kafka + Pyspark

Big Data Optimization - The Secret to Faster Spark Workloads with Smarter Data Distribution

Using Data Analytics To Identify Churn Risks

Churn Prediction Models

How to Analyze Customer Churn and Retention

Churn Rate Analysis

How to Use Ecommerce Analytics to Reduce Churn

Tips for Optimizing Apache Spark Performance

Explore content categories

Data Overview

Data wrangling

Exploratory Analysis

Model Building

More articles by Alexandra LORENZO DE BRIONNE

Kaggle Competition | Multi class classification on Image and Data

Demystification of Recommender Systems

Others also viewed

My First Take on Data Lakes

Self-Healing Apache Spark Pipelines: Architecting AI-Driven Resilient Data Processing

Empowering Data Analysis with LangChain-OpenAI in a Microsoft Fabric Notebook

🚀 Supercharge Your Data Pipelines: PySpark, Airflow, and the Path to Scalability

Databricks – The Unified Analytics Platform for All Your Data & AI Needs 🧠🔥

PySpark – Dynamic Partition Pruning

Real-Time Analytics - Kafka + Pyspark

Big Data Optimization - The Secret to Faster Spark Workloads with Smarter Data Distribution

Similar topics

Using Data Analytics To Identify Churn Risks

Churn Prediction Models

How to Analyze Customer Churn and Retention

Churn Rate Analysis

How to Use Ecommerce Analytics to Reduce Churn

Tips for Optimizing Apache Spark Performance

Explore content categories