Performance Evaluation Metrics for Classification Model

Performance Evaluation Metrics for Classification Model

One of the important steps in a machine learning problem is to evaluate its performance. So while building your model how would one measure the success of a machine learning model? When to stop the training and evaluation, and most importantly when to call it good?

In this article we will discuss what types of evaluation metrics we have for classification models. We will also discuss specific use cases in subsequent sections to better understand these metrics.

What are we going to discuss in this article:

  • Confusion Metrics and terms associated with it.
  • Evaluation Metrics (Accuracy, Precision, Recall, F1 score)
  • Why is Accuracy not always the best Evaluation Metric?
  • An Analogy to understand Precision & Recall
  • When to use Precision and When to use Recall?
  • Role of F1 Score
  • Conclusion


Confusion Matrix

Let’s start with a confusion matrix.

Confusion Matrix is a performance measurement for the machine learning classification problems where the output can be two or more classes. It is a table with combinations of predicted and actual values.

No alt text provided for this image

  • True Positives (TP): True positives are the cases when the actual class of the data point was 1(True) and the predicted is also 1(True)
  • True Negatives (TN): True negatives are the cases when the actual class of the data point was 0(False) and the predicted is also 0(False)
  • False Positives (FP): False positives are the cases when the actual class of the data point was 0(False) and the predicted is 1(True). False is because the model has predicted incorrectly and positive because the class predicted was a positive one. (1)
  • False Negatives (FN): False negatives are the cases when the actual class of the data point was 1(True) and the predicted is 0(False). False is because the model has predicted incorrectly and negative because the class predicted was a negative one. (0)


Evaluation Metrics

Accuracy

Most common evaluation metric that is used is accuracy. It measures how many observations both positive and negative, were correctly classified.

Mathematically,

Accuracy = (TP + TN) / (TP + FN + FP + TN)

Precision

In the simplest terms, Precision is the ratio between the True Positives and all the Positives. It would be defined as the measure of patients that we correctly identify having a heart disease out of all the patients actually having it. 

Mathematically,

Precision = TP / (TP + FP)

Recall

The recall is the measure of our model correctly identifying True Positives. Thus, for all the patients who actually have heart disease, recall tells us how many we correctly identified as having a heart disease. Mathematically,

Recall = TP / (TP + FN)

F1 Score

It gives a combined idea about Precision and Recall metrics. It is maximum when Precision is equal to Recall. F1 Score is the harmonic mean of precision and recall. 

Mathematically,

F1 = 2 x (Precision x Recall) / (Precision + Recall)


Accuracy is not always the best evaluation metric

You shouldn’t use accuracy on imbalanced problems. Then, it is easy to get a high accuracy score by simply classifying all observations as the majority class.

Imagine we have a dataset where only 1% of the samples are cancerous. A classifier that simply predicts all outcomes as benign would achieve an accuracy score of 99%. However, this model would, in fact, be useless and dangerous as it would never detect a cancerous observation.


An Analogy to understand Precision & Recall

Understand it like fishing with a net. You use a wide net, and catch 80 of 100 total fish in a lake. That’s 80% recall. But you also get 80 rocks in your net. That means 50% precision, half of the net’s contents is junk.

You could use a smaller net and target one pocket of the lake where there are lots of fish and no rocks, but you might only get 20 of the fish in order to get 0 rocks. That is 20% recall and 100% precision.


When to use Precision and When to use Recall?

Precision is used in the cases when False Positive is a higher concern than False Negatives (like, music or video recommendation systems, e-commerce websites, etc.).


Whereas, recall is a useful metric in cases where False Negative is of higher concern than False Positive (like, in medical cases where it doesn’t matter whether we raise a false alarm but the actual positive cases should not go undetected!).

Let’s dive a little deep to understand their use cases,

When False Positive is a higher concern - (Precision)

Let’s use a different example where the model classifies whether an email is spam or not. Let’s assign a label to the target variable and say, 1: “Email is a spam” and 0: “Email is not a spam”

A non-spam email classified as spam comes under False Positive.

Suppose the Model classifies that important email that you are desperately waiting for, as Spam (case of False positive). Now, in this situation, this is pretty bad than classifying a spam email as important or not spam (case of False Negative) since in that case, we can still go ahead and manually delete it and it’s not a pain if it happens once a while. So in case of Spam email classification, minimizing False positives is more important than False Negatives.

When False Negative is a higher concern - (Recall)

A person having cancer and the model classifying his case as No-cancer comes under False Negatives.

Let’s say in our Cancer detection problem, our main concern is to detect all the cancerous patients correctly. So if our model classifies a case as Cancerous when the person actually NOT having cancer (case of False Positive). This might be okay as it is less dangerous than NOT identifying/capturing a cancerous patient since we will anyway send the cancer cases for further examination and reports. But missing a cancer patient means classifying his case as No-cancer (case of False Negative) will be a huge mistake as no further examination will be done on them.

The Role of the F1-Score

This is easier to work with since now, instead of balancing precision and recall, we can just aim for a good F1-score and that would be indicative of a good Precision and a good Recall value as well.

Understanding Accuracy made us realize, we need a tradeoff between Precision and Recall. We first need to decide which is more important for our classification problem.

There are also a lot of situations where both precision and recall are equally important. For example, for our model, if the doctor informs us that the patients who were incorrectly classified as suffering from heart disease are equally important since they could be indicative of some other ailment, then we would aim for not only a high recall but a high precision as well. In such cases, we use something called F1-score.

Note: An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds using True positive Rate and False Positive Rate, which will be discussed in later articles.

Conclusion

Understanding how well a machine learning model will perform on unseen data is the main purpose behind working with these evaluation metrics. We saw how to evaluate a classification model, especially focusing on precision and recall, and find a balance between them. Also, we explain how to represent our model performance using different metrics and a confusion matrix.

To view or add a comment, sign in

Others also viewed

Explore content categories