Explaining the Difference Between Recall and Precision with Dogs
When you think about machine learning, the idea that might come to mind is that of computers learning how to do everything from identifying fraudulent credit activities to taxi demand, all without the aid of a human being.
A machine learning model is about training the system without explicit guidelines. The model learns the structure or patterns within a dataset without being given step-by-step programming instructions on how to do something. Think of it as your friend that can put together an IKEA cabinet just by looking at lots of other cabinets without reading the directions.
With that said, I thought it would be worthwhile to look at image recognition, a task we do every day but one that has just been mastered by machine learning. Specifically, let’s talk about identifying dogs and how to evaluate whether a model is successful at finding which images really do feature our four-legged friends.
The problem of identifying pictures of dogs is considered a classification task in machine learning. If we were to give a technical definition of classification modeling within the context of this problem, it would entail finding a function that would look at all the features in an image to predict whether there is a dog present. The model would identify features such as the shape of the ears in the photo, the number of eyes, the shape of the nose (a 1/0 value for whether that animal has a snout) or if the animal has a tail (again, a 1/0 value for whether that animal has a tail). Once the model had identified all of the different feature values in the picture, it would compare them to the values from photos labeled as “dogs” or “not dogs” to make its prediction with a discrete, usually categorical value. In this example, you would simply have an output that would be a 1 for an image of a dog and a 0 for a non-dog image.
So far, all of this sounds relatively straightforward if the machine learning model were a human being. But unfortunately, since we have not quite reached the age of the singularity, there are a couple of things to consider when looking at how well two models classify photos of dogs and non-dogs. In fact, a small-scale example gives an idea about a classic saying in statistical learning from George P. Box, "Essentially, all models are wrong, but some are useful."
In our hypothetical situation, let's say we have 15 photos. 10 of these photos are dogs and 5 do not have any dogs in them.
Typically, you would assume the best-performing model would be one that gets the highest accuracy when looking at these photos. There’s a problem though with that line of thinking, especially when you look at the comparison between Models A and B:
Model A: Classifies all photos as dogs, correctly classifies 10 out of 10 dog photos
Model B: Classifies 10 photos as dogs, correctly classifies 9 photos out of 10 dog photos
At first glance, Model A is the clear winner as it has higher accuracy. Model A’s accuracy, or the number of correct predictions divided by the total number of predictions, is 0.67. Model B’s accuracy is only 0.6. The problem here is that accuracy defined in this manner doesn't tell us the entire story and can hide potential problems, especially if there are imbalances in the classes of future data sets we test. With the models we’re comparing, Model A predicts that all photos have dogs in them whereas Model B only predicted 9 out of the 15 as having dogs, so though the accuracy is better here, Model B should be preferred.
Why is that? Well to more accurately measure if our model is working, we need to look at the precision and recall capabilities of our classifier.
Learning about the confusion matrix
Before jumping into how precision and recall affect which model we should use, it is important to understand the concept of true positives, false positives, false negatives and true negatives within the confines of the confusion matrix. The idea here is based on binary classification process and in the context of our problem, it is easy to give you an example of each:
True positive: The machine learning model guesses a photo has a dog and it does.
MODEL GUESS: DOG
False positive (Type I Error): The machine learning model guesses a photo has a dog and it does not.
MODEL GUESS: DOG
False negative (Type II Error): The machine learning model guesses a photo does not have a dog and it does.
MODEL GUESS: NO DOG
True negative: The machine learning model guesses a photo does not have a dog and it does not.
MODEL GUESS: NO DOG
By the way, in a later post, I’ll address Type I and Type II errors, but for now, there’s my most straightforward explanation:
Precision is all about your positive predictions being correct
So when measuring the precision of our two models, we want to figure out how many of the photos classified as dogs actually contain dogs. The mathematical formulation would be the number of true positives divided by the sum of true positives and false positives.
Model A: 10 True Positives/(10 True Positives + 5 False Positives) = 0.67
Model B: 9 True Positives/(9 True Positives + 1 False Positive) = 0.90
Model B is significantly more precise as it does not just consider all photos to have dogs.
Recall is all about the proportion of actual positives being predicted correctly
When trying to define recall in this case, you need to see how well the models actually predict the true classes of the images. With this measurement, you are looking at at a mathematical formulation of the true positives divided by the sum of true positives and false negatives.
Model A: 10 True Positives/(10 True Positives + 0 False Negatives) = 1
Model B: 9 True Positives/(9 True Positives + 1 False Negative) = .9
Though Model A has better recall, Model B is fairly close to Model A. And typically, it is important to remember that precision and recall are usually in a tug of war, meaning that an improvement in one will decrease the other so balancing the two affects the classifier you want to choose. Thus, the F1 score core, or harmonic mean of precision and recall, was developed to better measure accuracy and potentially deal with the example I have outlined. The mathematical formulation of the F1 score is 2 multiplied by precision times recall divided by the sum of precision and recall:
Model A: 2 * (1 * .67)/(1 + .67) = 0.80239520958
Model B: 2 * (.9 * .9)/(.9 + .9) = 0.9
Thus as I stated before, Model B is the winner with a higher accuracy. And if you read this far, you have a much better sense of why it is better to be wrong sometimes than right all the time with a small data set.
If you are looking for a more in-depth overview of classification in machine learning, check out the Google Developers Crash Course on Machine Learning and as always, any feedback is appreciated. Feel free to reach out here or via my email at alex@jetwolfelabs.com.