How to Predict Customer Churn using Autoencoders in Python
Ever heard of an autoencoder, wondered what it was, or how to use it?
My colleague Niharika Bangur and I worked on this article about it together. If you don't have time to read this whole thing, here are a few highlights about autoencoders:
If you want to know more, keep reading! You can even follow along with our detailed codebook and csv file provided here.
What is an Autoencoder?
An autoencoder is an unsupervised learning technique that takes the help of using a Neural Network for the task of representation learning. A Neural Network is a computational model inspired by the structure and functioning of the human brain. It consists of interconnected nodes, called neurons, organized in layers. Each neuron receives input, applies a mathematical function to it, and produces an output. Through a process called training, neural networks can learn from data and adjust the strengths of connections between neurons to make predictions or solve complex tasks, such as image recognition or natural language processing.
There are two important functions that an autoencoder learns. An encoder that transforms the input data and a decoder that recreates the output data. The encoder compresses the data into a smaller representation and then the decoder reconstructs the original data from the compressed representation.
During training the autoencoder tries to minimize the difference between the input data and the reconstructed data. For this, the network has to make sure to learn a compressed representation that makes sure to capture the most important information of the input data. Autoencoders can be used for tasks such as data compression, image denoising anomaly detection and feature extraction.
The main idea of an autoencoder is to recreate the input as best as possible. This can be measured by the reconstruction loss.
An important section we have not talked about yet is the Hidden Layer called the Latent Space which is also known as Bottleneck and in image 1 can be seen as Latent Representation. The reason for it being called a bottleneck is because it is a narrow part of the network where the data must pass through. This is the compression of the original data (input). The compressed data does have fewer dimensions than the original data, the most important features are kept, and others are discarded; this is what the Bottleneck layer is - where these features are preserved. The compressed representation is then used by the decoder to reconstruct the original data.
The main idea behind the bottleneck is to force the autoencoder to learn a compact and efficient representation of the input data.
Three Important Types of Autoencoders
How to Predict Customer Churn using Autoencoders
In today's highly competitive market we are able to see that customer churn has become a significant concern for businesses. Churn refers to the loss of customers or clients who have stopped doing business with a company.
Detecting customer churn in advance is crucial for businesses to take proactive measures to retain their customers. This example is a Deep Autoencoder Model because this will be trained on a large dataset of customers to extract the key features to detect the churned customers.
In this example we will use data about customers at a fictitious bank and use Python as our programming language.
Our analysis and predictions were executed in Google Colaboratory (“Colab”) which is a product by Google where users can write and run Python code in a browser-based environment without the need for any local installation; however, this code can be ran in other environments that handle python as a coding language.
There are 7 steps to conduct this analysis:-
Reading the Data
In Colab, there are a few ways you can upload your data files. We clicked on the folder icon on the left-hand side, then dragged and dropped the file (uploading from your computer can also be done).
The first step is to read in the data and make sure that we understand the data and see what it consists of. The data used was called Customer_Churn.csv.
On the first line we are importing pandas which is a Python library which is part of a data manipulation dictionary. Next, we read in the dataset and then we take a closer look to show us the top 5 rows of a data set called df.
The “Churned” column is very important to us. In this column 1 = customer churned and 0 = customer did not churn. Now that we have seen what our data looks like we move onto the next step.
Dropping Unnecessary Variables
In this data there are some variables that are unnecessary to predict the number of customers churned. For us, “CustomerId” is not of much value and neither is the “Lastnames” column since they identify individuals, but do not provide any useful information in detecting customer churn, and hence they can be dropped.
We used the .drop() function and selected two columns. We used axis = 1 to indicate that we are removing columns and inplace=True to ensure that our dataframe (df) is modified in place, meaning the changes are applied directly to df.
As you can see in image 6 the two columns are not present in the dataset anymore.
Dummy-encode Categorical Variables
Dummy-encoding is a technique used to represent the categorical variables as numerical in the data. In image 7 the function .get_dummies doesn't need a column name to be mentioned since it makes dummies for all the categorical variables. The presence or absence will be indicated to let us know whether a specific category is present or is not present for each row in the dataset. We used the drop_first=True argument. It instructs the encoding process to drop the first level or category for each variable, resulting in n-1 dummy variables for a categorical variable with n distinct categories. By dropping the first level, we implicitly encode the presence or absence of each category relative to the first one. Extra columns are made for every unique category and the original columns are dropped.
In this dataset there are two categorical variables, “Geography” and “Gender”. We dummy-encoded both columns. “Geography” initially contained three countries: France, Germany, and Spain. After dummy encoding and dropping the first variable within the Geography column, we are left with a “Geography_Spain” and “Geography_Germany” column. If both values are 0, like in line 5 of Image 7, then that row corresponds to France and not Spain or Germany.
“Gender” initially contained Male or Female. After dummy encoding and dropping the first variable within the Gender column, we are left with Gender_Male;1 = male, and 0 = female.
Recommended by LinkedIn
Scaling the Data
Scaling the data means transforming the numerical variables to a specific scale or range. This helps make sure that there is no bias and all the features contribute equally to the analysis and avoid any influence by outliers. The MinMaxScalar() function is a popular method of scaling the data in the range of (0,1). This happens by the maximum value minus the minimum value.
In image 8 the fit_transform() method is used on the feature set X to transform the values of each feature to a range between 0 - 1 using the MinMaxScaler() function. The feature is set by dropping the variable “Churned”. This gets assigned to y to contain the value of the target variable.
Fitting an AutoEncoder Architecture
First we need to import some important libraries to be able to use certain functions.
The image 10 is the most important. Here, we develop our deep autoencoder. Something to keep in mind is that “relu” and units = 0.1 helps in preventing overfitting in an autoencoder.
Table 1 provides a detailed description of what each line of the autoencoder architecture does.
We then compile the autoencoder by using the Adam optimizer and mean squared error (mse) as the loss. Compiling refers to configuring the autoencoder for training. When you compile an autoencoder, you specify how it should learn from the provided data.
The reason for choosing mse was because it measures the average squared difference between the predicted values and the true values to measure how well the autoencoder model is able to reconstruct the input data.
Next, moving onto fitting the model. The usage of X, X together is because the input data is used both as the input and the target output. Epochs is the number of times to iterate over the dataset. The batch_size specifies the number of samples that have been used in each batch to make sure of the memory usage. The validation split specifies the fraction of the training data will be set aside for validation.
Determining the 500 most likely to churn customers
After fitting the autoencoder, we can use it to predict the customers who are most likely to churn. We can do this by selecting the customers whose reconstruction error is high. The reconstruction error is the difference between the original input data and the reconstructed data obtained from the autoencoder. We can sort the customers based on their reconstruction error and select the 500 observations with the highest error as the most likely to churn.
Now we can evaluate the reconstruction error for the input data “X”. The **2, in the code squares the difference between the input and the prediction for each level. The np.mean() function computes the average error across all samples.
In image 15 the histogram plot of the MSE shows the reconstruction errors that we computed above.
The reconstruction error values are then added as a new column into a new dataset called df_error. It also contains the columns from the prior data. The values are then sorted using the sort_values() function. See Image 16.
We then extracted 500 observations with the highest reconstruction error values which would be those most likely to churn.
With the result above we can see that 185 customers from the 500 observations with the largest errors churned.
Verifying the Predictions
First in the whole dataset we need to see how many customers churned. This code counts the number of 1’s in the “Churned” column because we know that 1 = churned.
In this last code we take the number of churned customers from the 500 observations with the largest errors and divide it by the number of customers that actually churned in the whole dataset which is 1822.
With this we can see that 10% of the customers were detected that churned out of the top 500 customers.
Final Thoughts
Through the help of an Autoencoder model we were able to detect the number of customers that churned. Autoencoders can be very helpful and can be used for many types of situations.
Citations for Images
Image 2: Example of a deep autoencoder using a neural network
Roche, Fanny & Hueber, Thomas & Limier, Samuel & Girin, Laurent. (2019). Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models.
Image 3:Denoising Autoencoder
Nishad, Garima. “Reconstruct Corrupted Data Using Denoising Autoencoder (Python Code).” Medium, Analytics Vidhya, 25 Feb. 2021
Absolutely amazing work!