Evaluation Techniques for Machine Learning Models
In the present day, Machine Learning allows gain insight from data using through predictive modelling. It basically involves utilizing mathematical models to create business insights for understanding the data at hand. Once these models have been fitted to prepared data, it can be used to predict newly observed data.
In Machine Learning, truthfully models can only be as useful as their quality of predictions; hence the fundamental our goal is never to create models but to create high-quality models with promising predictive power.
Alright, I’m done being a traditional lecturer. Let’s examine strategies for evaluating the quality of models that are generated by ML algorithms.
1. Binary Classifier Evaluation Metrics
Should I explain binary classifiers? No!, not in this article, please.
When it comes to evaluating a Binary Classifier, assuming you’ve built one. if not Kaggle has a great resource to begin.
Accuracy is a popular performance evaluation metric. It can be used to differ a better classification model from one that is weak.
a. Accuracy is, simply put, the total proportion of observations that have been correctly predicted. So there are four (4) main components that comprise the mathematical formula for calculating Accuracy.
· TP – True positive (it is the total number of labels that belong to the positive class and have been predicted correctly)
· TN – True negative (it is the total number of labels that belong to the negative class and have been predicted correctly.)
· FP – False positive (This is the total number of labels that have been predicted to belong to the positive class, but actually, belong to the negative class. It is also known as a Type 1 Error)
· FN– false negative (it is the total number of labels that have been predicted to belong to the negative class but instead belong to the positive class. It may be referred to as a Type II Error), and these components grant us the ability to explore other ML Model Evaluation Metrics. The formula for calculating accuracy is:
The sole reason for utilizing the Accuracy Evaluation Metric is for ease of use.
My disclaimer and if I was a lawyer, CAVEAT!
Accuracy, as it is, is an evaluation metric that does not perform well when the classes are imbalanced. It suffers from a paradox; where the accuracy value maybe high but the model is seriously lacking predictive power and most, if not all the time, predictions are going to be incorrect.
For this reason, we are highly compelled to turn to other evaluation metrics in the scikit-learn arsenal, obviously not England’s Arsenal FC (no offence – none taken. lol) to understand this better.
Our Case Study
Most importantly, before we go on. You learn ml practically with hands-on. Let's use the Heart Disease Dataset available on the UCI repository. You can download the clean dataset from here and my notebook from here
Lets look at the confusion matrix values from the model:
A confusion matrix is an N * N matrix, where N is the number of labels being predicted. For this demonstration, let’s have N=2, and hence we get a 2 x 2 matrix.
From our train and test data, we already know that our test data consisted of 91 data points. That is the intersection of the 3rd column and 3rd row value at the end. We have also noticed that there are some actual and predicted values. The actual values are the number of data points that were originally categorized into 0 or 1. The predicted values are the number of data points our KNN model predicted as 0 or 1.
The actual values are:
While the predicted values are:
Recommended by LinkedIn
All the values we obtain above have a term. Let’s go over them one by one:
Back to base!
b. Precision – Precision is clearly the proportion (total number) of all observations that have been predicted to belong to the positive class and that are actually positive. In clearer terms, it is the ratio between the True Positives and all the Positives. For reference to our problem statement, that would be the measure of patients that we correctly identify having a heart disease out of all the patients actually having it. Mathematically:
What is the Precision for our model? Yes, it is 0.816(40/40+9) or, when it predicts that a patient has heart disease, it is correct around 82% of the time.
c. Recall - Recall is the measure of a model correctly identifying True Positives. Thus, for all the patients who actually have heart disease, recall tells us how many we correctly identified as having a heart disease. For our model, Recall = 0.95(40/40+2).
Recall provides insight of how accurately our model is able to identify the relevant data. We refer to it as Sensitivity or True Positive Rate(TPR). What if a patient has heart disease, but there is no treatment given to him/her because our model predicted they're negative? That is a situation with a big mess!
Mathematically:
d. F1 Score – In ML, there is always a trade-off between precision and recall. For example, for our case study, we can consider that achieving a high recall is more important than getting a high precision – we would like to detect as many heart patients as possible.
For some other models, like loan default classification, classifying whether a bank customer is a loan defaulter or not, it is desirable to have a high precision since the bank wouldn’t want to lose customers who were denied a loan based on the model’s prediction that they would be defaulters.
The F1-Score is the harmonic mean of the precision and recall values for a classification problem. This is an averaging metric that is used to generate a ratio. This evaluation metric is a measure of overall correctness that our model has achieved in a positive prediction environment-i.e., of all observations that our model has labeled as positive, how many of these observations are actually positive. The formula for the F1 Score
2. Regression Analysis Evaluation Metrics.
In Regression analysis, you will find that one of the widely used and well-known metrics for evaluation is the MSE. MSE stands for Mean Squared Error.
Mean Squared Error involves finding the squared sum of all the distances between predicted and true values.
A good MAE should stay as far away from 1 as possible.
In clearer terms: The higher the output value for MSE, the worse the quality of model predictions. This is because the high output value indicates there is a great sum of squared error present in the model.