Predicting Customer Churn with Classification Machine Learning Models

Forecasting telecom customer attrition using classification machine learning models.

Introduction

Supervised machine learning involves classification, a process that utilizes labeled data to train a model to assign categorical labels to new data points. The ultimate objective of classification is to accurately categorize new, previously unseen data points into their respective classes based on a set of known labeled data points.

Various classification algorithms are available such as logistic regression, decision trees, random forests, Support Vector Machines (SVM), Naïve Bayes, and K-Nearest Neighbors (KNN). The selection of an appropriate algorithm depends on the characteristics of the data, the number of classes, and the desired performance metrics.

Other factors to take into account are the model's interpretability, the amount of

training data available, the difficulty of the issue, and the available computational

resources.

Common performance metrics for classification tasks include accuracy, precision, recall, F1 score, and ROC AUC score. The choice of a performance metric depends on the problem and the desired trade-off between different types of errors.

It is also worth noting that the quality and representativeness of the labeled data used for training the model are critical for its performance and generalizability to new data. Data preprocessing techniques such as feature extraction, dimensionality reduction, and data balancing can also affect the performance of the model.

There are two types of classification tasks: binary, which can have only two outcomes, and multi-class, which can have multiple outcomes. Spam filtering, sentiment analysis, image recognition, and medical diagnosis are a few typical examples of categorization jobs.

Data Understanding

In this scenario, a sample dataset provided by a fictitious telecom company is being utilized. The objective is to analyze the dataset in order to identify the probability of customers discontinuing their services with the organization, as well as determining the significant factors leading to churn.

Characteristics of the aforementioned sample dataset are:

Gender -- Whether the customer is a male or a female

SeniorCitizen -- Whether a customer is a senior citizen or not

Partner -- Whether the customer has a partner or not (Yes, No)

Dependents -- Whether the customer has dependents or not (Yes, No)

Tenure -- Number of months the customer has stayed with the company

Phone Service -- Whether the customer has a phone service or not (Yes, No)

MultipleLines -- Whether the customer has multiple lines or not

InternetService -- Customer's internet service provider (DSL, Fiber Optic, No)

OnlineSecurity -- Whether the customer has online security or not (Yes, No, No Internet)

OnlineBackup -- Whether the customer has online backup or not (Yes, No, No Internet)

DeviceProtection -- Whether the customer has device protection or not (Yes, No, No internet service)

TechSupport -- Whether the customer has tech support or not (Yes, No, No internet)

StreamingTV -- Whether the customer has streaming TV or not (Yes, No, No internet service)

StreamingMovies -- Whether the customer has streaming movies or not (Yes, No, No Internet service)

Contract -- The contract term of the customer (Month-to-Month, One year, Two year)

PaperlessBilling -- Whether the customer has paperless billing or not (Yes, No)

Payment Method -- The customer's payment method (Electronic check, mailed check, Bank transfer(automatic), Credit card(automatic))

MonthlyCharges -- The amount charged to the customer monthly

TotalCharges -- The total amount charged to the customer

Churn -- Whether the customer churned or not (Yes or No)

After importing the required python library,

# Data handling
import pandas as pd
import numpy as np 

# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
import matplotlib.pyplot as plt 
import seaborn as sns
import plotly.express as px

# EDA (pandas-profiling, etc. )
...

# Feature Processing (Scikit-learn processing, etc. )
from sklearn import preprocessing

a series of steps were taken to explore the data and it was found that:

The data contains 7043 customers with 21 columns containing various customer information that can be summarized as customer’s demographic, account, and subscription information. The “churn” column which is the column we want to predict contains information on whether the customer churned or not.

Based on the statistics summary:

7043 records are available for analysis
8684.800000 is the maximum Total Charge
18.800000 is the minimum Total Charge
118.750000 is the maximum Monthly charge
18.250000 is the minimum Monthly charge

Average tenure is 32 months

There were no duplicates or missing values in the data.
The data was split into numerical & categorical data

# Split the columns into numerical and categorical data
df = Telco_churn.copy()
df.drop(['customerID'], axis=1, inplace=True)


# split the data into features (X) and target variable (y)
X = df.drop('Churn', axis=1)
y = df['Churn']

Based on the information provided in the dataset, there are some attributes that could potentially be combined to create new features:

# Combine "OnlineSecurity" and "DeviceProtection" into a new feature called "SecurityServices
X['SecurityServices'] = X['OnlineSecurity'] + X['DeviceProtection']

# Replace values with new labels based on whether the customer has no, one, or both security services
X['SecurityServices'] = X['SecurityServices'].replace({'NoNo': 'NoneSecurityServices',
                                                       'YesNo': 'OnlyOnlineSecurity',
                                                       'NoYes': 'OnlyDeviceProtection',
                                                       'YesYes': 'BothSecurityServices'})

# Drop the original "OnlineSecurity" and "DeviceProtection" features
X = X.drop(['OnlineSecurity', 'DeviceProtection'], axis=1)

X['SecurityServices'].head()

The above code combines the “OnlineSecurity” and “DeviceProtection” features into a new feature called “SecurityServices” and replaces the values with new labels based on whether the customer has no, one, or both security services. It then drops the original “OnlineSecurity” and “DeviceProtection” features.

This is useful for combining related features into a single feature and creating new labels based on the combinations of the original feature values.

# Create a new feature called "StreamingServices
X['StreamingServices'] = X['StreamingTV'] + X['StreamingMovies']

# Replace values with new labels based on whether the customer has no, one, or both Streaming Services
X['StreamingServices'] = X['StreamingServices'].replace({'NoNo': 'NoneStreamingServices', 
                                                         'YesNo': 'OnlyStreamingTV', 
                                                         'NoYes': 'OnlyStreamingMovies', 
                                                         'YesYes': 'BothStreamingServices'})

# Drop the original "StreamingTV" and "StreamingMovies" features
X = X.drop(['StreamingTV', 'StreamingMovies'], axis=1)

X['StreamingServices'].head()

A new feature called “StreamingServices” has been by concatenating the “StreamingTV” and “StreamingMovies” features. It then replaces the values in the new feature based on whether the customer has no, one, or both streaming services, using a dictionary to map the old values to the new labels. Finally, it drops the original “StreamingTV” and “StreamingMovies” features from the dataset.

# Combine PhoneService and MultipleLines into a single feature, PhoneServices
X['PhoneServices'] = X.apply(lambda x: 'MultipleLines' if x['MultipleLines'] == 'Yes' else 'SingleLine', axis=1)

# Drop PhoneService and MultipleLines columns
X.drop(['PhoneService', 'MultipleLines'], axis=1, inplace=True)

X['PhoneServices'].head(100)

This code combines the “PhoneService” and “MultipleLines” features into a new feature called “PhoneServices” and drops the original “PhoneService” and “MultipleLines” features.

This is useful for combining related features into a single feature and simplifying the dataset by removing redundant features.

# Create a new feature called "InternetServices"
X['InternetServices'] = X.apply(lambda row: 'DSL Only' if row['InternetService'] == 'DSL' and row['OnlineBackup'] == 'No' else
                                          'Fiber Optic Only' if row['InternetService'] == 'Fiber optic' and row['OnlineBackup'] == 'No' else
                                          'Internet and Backup' if (row['InternetService'] == 'DSL' or row['InternetService'] == 'Fiber optic') and row['OnlineBackup'] == 'Yes' else
                                          'No Internet Service', axis=1)

# Drop the original "InternetService" and "OnlineBackup" columns
X = X.drop(['InternetService', 'OnlineBackup'], axis=1)

X['InternetServices'].head()

A new feature called “InternetServices” is created by applying a lambda function to each row of the dataset. The lambda function checks the values of the “InternetService” and “OnlineBackup” features and assigns a label to the new feature based on those values. It then drops the original “InternetService” and “OnlineBackup” features from the dataset.

Features encoding

Machine learning algorithms require numerical data as inputs, so categorical features must be transformed before they can be used in these algorithms. There are several methods for encoding categorical features in Python however, One-Hot Encoding will be used in this case.

One-Hot Encoding: This method creates a new binary column for each possible category in a categorical feature. Each column is assigned a value of 0 or 1, indicating whether or not the category is present in the original feature. This method is commonly used when there are a small number of categories and/or when the categories are not ordered.

for col in X.columns
    print(f"Column '{col}' categories: {X[col].unique()}")

This code helps us understand the distribution of categorical data in a DataFrame, identifying any missing or unexpected categories, and selecting an appropriate feature encoding method based on the unique categories in each column.

Output:

# Import the necessary libraries
from sklearn.preprocessing import OneHotEncoder

columns_to_encode = ['Partner', 'Dependents', 'PaperlessBilling', 'gender']


# Create a OneHotEncoder object
ohe = OneHotEncoder()

# Fit and transform the columns using the OneHotEncoder
encoded_columns = ohe.fit_transform(X[columns_to_encode])

# Create a new DataFrame with the encoded columns
encoded_df = pd.DataFrame(encoded_columns.toarray(), columns=ohe.get_feature_names_out(columns_to_encode))

# Drop the original columns from the original DataFrame
X.drop(columns_to_encode, axis=1, inplace=True)

# Concatenate the original DataFrame with the new encoded DataFrame
X = pd.concat([X, encoded_df], axis=1)

The code is performing one-hot encoding on the specified columns in the DataFrame ‘X’ using the scikit-learn OneHotEncoder. The end result is that the original categorical columns in ‘X’ have been replaced with a set of one-hot encoded columns, where each original category is represented by a binary indicator column.

columns_to_encode = ['TechSupport', 'Contract','PaymentMethod', 'PhoneServices'
                     'SecurityServices','StreamingServices', 
                     'InternetServices']


# Create a OneHotEncoder object
ohe = OneHotEncoder()

# Fit and transform the columns using the OneHotEncoder
encoded_columns = ohe.fit_transform(X[columns_to_encode])

# Create a new DataFrame with the encoded columns
encoded_df = pd.DataFrame(encoded_columns.toarray(), columns=ohe.get_feature_names_out(columns_to_encode))

# Drop the original columns from the original DataFrame
X.drop(columns_to_encode, axis=1, inplace=True)

# Concatenate the original DataFrame with the new encoded DataFrame
X = pd.concat([X, encoded_df], axis=1)

The columns to be encoded are specified in the list columns_to_encode.This process effectively transforms the original categorical features into binary features, with a 1 indicating that a particular category is present and a 0 indicating that it is not.

Features Scaling

Scaling is a common preprocessing step that helps to ensure all the features have the same range of values, which can help improve the performance of machine learning models.

from sklearn.preprocessing import MinMaxScaler

# Create an instance of MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the MonthlyCharges and TotalCharges columns
X[['MonthlyCharges', 'TotalCharges', 'tenure']] = scaler.fit_transform(X[['MonthlyCharges', 'TotalCharges', 'tenure']])

This will scale the MonthlyCharges, TotalCharges, and tenure columns using the MinMaxScaler.

Machine Learning Modeling

Model -1

RandomForest Classifier over the imbalanced dataset

For the first model, we train RandomForest Classifier over the imbalanced dataset. For tree-based algorithms such as Random Forest, Gradient Boosting, and Decision Trees, scaling is not required because they are not affected by the scale of the input features.

Create the model

from sklearn.ensemble import RandomForestClassifie
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from imblearn.metrics import geometric_mean_score 

# Instantiate the Random Forest model
rf = RandomForestClassifier(random_state=42)

After importing the necessary modules and instantiated the random forest classifier, you can proceed with fitting the model and evaluating its performance. Before that, however, the dataset needs to be split into training and testing sets.

Train the model

# Fit the model to the training dat
rf.fit(X_train, y_train)

# Use the model to make predictions on the test data
y_pred_rf = rf.predict(X_test)

Evaluate the model on the evaluation dataset

# Print the classification report
print(classification_report(y_test, y_pred_rf))

This is a classification report for a model’s performance on predicting two classes — 0 and 1.

In general, a higher value is better as it indicates better overall performance of the model across all classes.

The precision for class 1 is 0.60, which means that out of all the instances the model predicted as class 1, 60% of them are actually positive.

The recall for class 1 is 0.48, which means that out of all the actual positive instances, the model correctly identified only 48% of them.

the F1-score for class 1 is 0.53, which is the weighted average of precision and recall for class 1.

Model -2

Decision Tree over the imbalanced dataset

For the 2nd model, we train Decision Tree algorithm.

This algorithm creates a tree-like model of decisions and their possible consequences. It is commonly used for classification and regression problems

Classification Report for the model after creating & training

In the second report, precision for class 0 is 0.81, and for class 1, it is 0.48.

The first classification report has better overall metrics compared to the second classification report.

In the first classification report, the accuracy is 0.78, while in the second classification report, it’s 0.73. This means that the first model is more accurate at predicting the correct class for new unseen data.

Similarly, the recall for class 1 in the first classification report is 0.48, while in the second classification report, it’s 0.46. This means that the first model has a better ability to correctly identify all the positive samples (i.e., churn customers) out of all the actual positive samples in the data.

Model - 3

Adaptive Boosting (AdaBoost) over the Imbalanced dataset

For the 3rd model, we train an Adaptive Boosting (AdaBoost)algorithm.

This algorithm builds a series of decision trees, where each subsequent tree tries to correct the errors made by the previous tree.

In the third classification report, the precision, recall, and F1-score for class 1 have improved significantly, indicating that the model is performing better in identifying the positive cases.

Model - 4

Naive_bayes over Balanced Train Data

For the 4th model, we train Naive_bayes over Synthetic Minority Oversampling Technique (SMOTE).

from imblearn.over_sampling import SMOTE

# Create the SMOTE object
smote = SMOTE()

# Apply SMOTE to the training data only
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

Model - 5

SVM Over Balanced Train Data

For the 5th model, we train SVM. For other algorithms such as SVM and K-Nearest Neighbors, scaling may be necessary to ensure that all features are on a similar scale.

Model - 6

K-Nearest Neighbors over balanced Train Data

For the 6th model, we train K-Nearest Neighbors over balanced Train Data.

Model - 7

Gradient Boosting Classifier over balanced Train Data

For the 7th model, we train Gradient Boosting Classifier over balanced Train Data.

Model - 8

cat boost over balanced Train Data

For the 8th model, we train cat boost over balanced Train Data

Model - 9

XGboost over balanced Train Data

For the 9th model, we train XGBoost over balanced Train Data

Model - 10

Adaptive Boosting (AdaBoost) over the Balanced dataset

For the 3rd model, we train an Adaptive Boosting (AdaBoost) algorithm. This algorithm builds a series of decision trees, where each subsequent tree tries to correct the errors made by the previous tree.

Comparing all the models

In conclusion, based on the evaluation metrics, the AdaBoost model trained on the imbalanced dataset performed the best, with an accuracy of 0.795, precision of 0.637, recall of 0.535, and F1-score of 0.581.

The Naive Bayes model performed well in terms of recall (0.81), but had lower precision and F1-score. The Random Forest model also performed well in terms of accuracy (0.779) and had higher precision and F1-score than the Naive Bayes model.

Overall, it seems that the ensemble models (AdaBoost, Random Forest, Gradient Boosting, CatBoost, and XGBoost) performed better than the other models on this dataset.

Introduction

Data Understanding

Features encoding

Output:

Features Scaling

Recommended by LinkedIn

Machine Learning Modeling

Model -1

RandomForest Classifier over the imbalanced dataset

Train the model

Evaluate the model on the evaluation dataset

Model -2

Decision Tree over the imbalanced dataset

Model - 3

Adaptive Boosting (AdaBoost) over the Imbalanced dataset

Model - 4

Naive_bayes over Balanced Train Data

Model - 5

SVM Over Balanced Train Data

Model - 6

K-Nearest Neighbors over balanced Train Data

Model - 7

Gradient Boosting Classifier over balanced Train Data

Model - 8

cat boost over balanced Train Data

Model - 9

XGboost over balanced Train Data

Model - 10

Adaptive Boosting (AdaBoost) over the Balanced dataset

Comparing all the models

More articles by Tikue Zeleke

The Game-Changing Role of Data Science and Machine Learning in Football

Analyzing Indian Startup Ecosystem: A Closer Look at Funding by Sector

Others also viewed

LEVERAGING MACHINE LEARNING MODELS TO PREDICT CUSTOMER CHURN FOR VODAFONE PLC.

The Hidden Goldmine: How Conversation Analytics Transforms Customer Intent Recognition

Customer Insight Gaps: The Most Dangerous Blind Spot

There is More to Customer Insight Than Trends. It's the Missing Relationships

Predicting Customer Churn: A Hybrid AI Architecture and Cost Breakdown

Predictive Customer Success

Xavient Research – ‘Next Gen Churn Models’ for ICT domain

Telco Customer Churn Prediction using Machine Learning

Sentiment Analysis of Customer Reviews on British Airways

Similar topics

Supervised Learning Techniques

Churn Prediction Models

Machine Learning in Marketing Analytics

Machine Learning in Telecom

Machine Learning Models For Healthcare Predictive Analytics

Explore content categories