Logistic Regression and Decision Tree Classifier to predict churn for the company.

Bruno Miranda

Published Jul 31, 2023

Our problem relies on a database with features of various clients and the services contracted by these clients. From this, I have created a model that classifies whether a client is likely to cancel the service or not, becoming a churn for the company.

To achieve this, I have used Spark itself, which has a module called MLlib, allowing us access to various machine learning algorithms that can address our classification problem.

The models that I have explored was logistic regression and decision tree classifier.

Although I have already processed our data to be used by the machine learning algorithm, I needed to make one more modification to the database to make it compatible with Spark's machine learning models.

Spark requires a more primitive data type to work, so I had to convert the DataFrame into data vectors. The tool I used for this was the VectorAssembler.

Now we have only two columns, one called 'Features,' and another called 'Label' (I had to make this change as the VectorAssembler expects these parameters to be inserted).

The 'Label' represents the churn and still contains values of 1 and 0, representing whether the person has canceled the service or not. On the other hand, the 'Features' represent the data stored from all columns.

It was necessary to remove some columns from the dataset_prep (label, which is the target, cannot be among the feature data since it is precisely what we want to discover, and the Id, which does not add value for discovery and classification of the data. The first value is the number 24, which is repeated for all data and represents the number of features used for prediction.

Recommended by LinkedIn

Case study: Analysis, Logistic Regression, Model…

Dharsha Mareedu 2 years ago

Logistic Regression as classifier of labeled data of…

Nathan MANZAMBI 8 years ago

Building Logistic Regression Models with Ease: How…

Gopal Malakar 2 years ago

The objective now is to demonstrate the difference between the two models mentioned in the title of this article: Logistic Regression and Decision Tree Classifier. For this purpose, I used a PySpark tool called Multiclass Classification Evaluator, an evaluator for classification with more than two classes.

This code evaluates the performance of a Logistic Regression model and the Decision Tree Classifier, on both the training and test data by printing the confusion matrices and several key metrics, such as accuracy, precision, recall, and F1 score. The example below is for the Logistic Regression model, but with some changes we can modify it for the Decision Tree Classifier.

print('Logistic Regression - Train data')
print("="*40)
matrix_confusion(lr_train, normalize=False)
print("-"*40)
print("Metrics")
print("Accuracy: %f" % evaluator.evaluate(lr_train, {evaluator.metricName: "accuracy"}))
print("Precision: %f" % evaluator.evaluate(lr_train, {evaluator.metricName: "precisionByLabel", evaluator.metricLabel: 1}))
print("Recall: %f" % evaluator.evaluate(lr_train, {evaluator.metricName: "recallByLabel", evaluator.metricLabel: 1}))
print("F1: %f" % evaluator.evaluate(lr_train, {evaluator.metricName: "fMeasureByLabel", evaluator.metricLabel: 1}))
print("="*40)
print("Logistic regression - Test data")
print("="*40)
matrix_confusion(predicted_lr_test, normalize=False)
print("-"*40)
print("Metrics")
print("Accuracy: %f" % evaluator.evaluate(predicted_lr_test, {evaluator.metricName: "accuracy"}))
print("Precision: %f" % evaluator.evaluate(predicted_lr_test, {evaluator.metricName: "precisionByLabel", evaluator.metricLabel: 1}))
print("Recall: %f" % evaluator.evaluate(predicted_lr_test, {evaluator.metricName: "recallByLabel", evaluator.metricLabel: 1}))
print("F1: %f" % evaluator.evaluate(predicted_lr_test, {evaluator.metricName: "fMeasureByLabel", evaluator.metricLabel: 1})))

This is the code output for Logistic Regression model:

This is the code output for Decision Tree Classifier:

In summary, the confusion matrix shows the quantity of correct and incorrect predictions of the model for each class. This is important to evaluate the performance of the classification algorithm and to check if it is making many false positives or false negatives. Depending on the context of the problem, these errors can have different implications and significance.

Logistic Regression and Decision Tree Classifier to predict churn for the company.

Bruno Miranda

Recommended by LinkedIn

Others also viewed

Unveiling the Power of Multiclass Classification (Part 9)

Fighting Overfitting in Machine Learning Models: Ridge Regression Vs Lasso Regression

Decision Tree - Introduction

"Python-Powered Data Journey: From Exploration to Insight"

Understanding Linear Regression and Hypothesis Testing: A Simple Practical Example

Data Science Simplified Part 5: Multivariate Regression Models

Dos and Don'ts When Starting Out with Linear Regression

Data Analysis & Data Science Stack Part III

Explore content categories