Logistic Regression and Decision Tree Classifier to predict churn for the company.

Logistic Regression and Decision Tree Classifier to predict churn for the company.

Our problem relies on a database with features of various clients and the services contracted by these clients. From this, I have created a model that classifies whether a client is likely to cancel the service or not, becoming a churn for the company.

To achieve this, I have used Spark itself, which has a module called MLlib, allowing us access to various machine learning algorithms that can address our classification problem.

The models that I have explored was logistic regression and decision tree classifier.

Although I have already processed our data to be used by the machine learning algorithm, I needed to make one more modification to the database to make it compatible with Spark's machine learning models.

No alt text provided for this image

Spark requires a more primitive data type to work, so I had to convert the DataFrame into data vectors. The tool I used for this was the VectorAssembler.

Now we have only two columns, one called 'Features,' and another called 'Label' (I had to make this change as the VectorAssembler expects these parameters to be inserted).

The 'Label' represents the churn and still contains values of 1 and 0, representing whether the person has canceled the service or not. On the other hand, the 'Features' represent the data stored from all columns.

It was necessary to remove some columns from the dataset_prep (label, which is the target, cannot be among the feature data since it is precisely what we want to discover, and the Id, which does not add value for discovery and classification of the data. The first value is the number 24, which is repeated for all data and represents the number of features used for prediction.

No alt text provided for this image

The objective now is to demonstrate the difference between the two models mentioned in the title of this article: Logistic Regression and Decision Tree Classifier. For this purpose, I used a PySpark tool called Multiclass Classification Evaluator, an evaluator for classification with more than two classes.

This code evaluates the performance of a Logistic Regression model and the Decision Tree Classifier, on both the training and test data by printing the confusion matrices and several key metrics, such as accuracy, precision, recall, and F1 score. The example below is for the Logistic Regression model, but with some changes we can modify it for the Decision Tree Classifier.

print('Logistic Regression - Train data')
print("="*40)
matrix_confusion(lr_train, normalize=False)
print("-"*40)
print("Metrics")
print("Accuracy: %f" % evaluator.evaluate(lr_train, {evaluator.metricName: "accuracy"}))
print("Precision: %f" % evaluator.evaluate(lr_train, {evaluator.metricName: "precisionByLabel", evaluator.metricLabel: 1}))
print("Recall: %f" % evaluator.evaluate(lr_train, {evaluator.metricName: "recallByLabel", evaluator.metricLabel: 1}))
print("F1: %f" % evaluator.evaluate(lr_train, {evaluator.metricName: "fMeasureByLabel", evaluator.metricLabel: 1}))
print("="*40)
print("Logistic regression - Test data")
print("="*40)
matrix_confusion(predicted_lr_test, normalize=False)
print("-"*40)
print("Metrics")
print("Accuracy: %f" % evaluator.evaluate(predicted_lr_test, {evaluator.metricName: "accuracy"}))
print("Precision: %f" % evaluator.evaluate(predicted_lr_test, {evaluator.metricName: "precisionByLabel", evaluator.metricLabel: 1}))
print("Recall: %f" % evaluator.evaluate(predicted_lr_test, {evaluator.metricName: "recallByLabel", evaluator.metricLabel: 1}))
print("F1: %f" % evaluator.evaluate(predicted_lr_test, {evaluator.metricName: "fMeasureByLabel", evaluator.metricLabel: 1})))        

This is the code output for Logistic Regression model:

No alt text provided for this image

This is the code output for Decision Tree Classifier:

No alt text provided for this image

In summary, the confusion matrix shows the quantity of correct and incorrect predictions of the model for each class. This is important to evaluate the performance of the classification algorithm and to check if it is making many false positives or false negatives. Depending on the context of the problem, these errors can have different implications and significance.

To view or add a comment, sign in

Others also viewed

Explore content categories