Debugging Deep Learning models
Grad-CAM applied to traffic sign classification

Debugging Deep Learning models

This is the second article describing my progress in the Udacity Self-Driving Car Engineer program. Each article will describe one project that is a mandatory part of the program.

In the second project the task was to write a traffic sign image classifier for the German Traffic Sign Dataset using a deep learning models and Tensorflow. The dataset provided was a slight modification of the original dataset where all images where reduced to a size of (32x32x3). To pass the exam a validation accuracy of 93% was needed over 43 different classes. I will not go into all the details surrounding the actual classification task since image classifiers are common knowledge in 2018 (just google deep learning for image classification) and you can find all the code at my gihub account with a link in the end of the article. What I find more interesting is how to debug and get insight into what a deep model has actually learned after it has been trained. However, before I get into that I will outline the model I used below.

Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_10 (Conv2D)           (None, 30, 30, 32)        896       
_________________________________________________________________
max_pooling2d_10 (MaxPooling (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 14, 14, 128)       16512     
_________________________________________________________________
max_pooling2d_11 (MaxPooling (None, 7, 7, 128)         0         
_________________________________________________________________
conv2d_12 (Conv2D)           (None, 6, 6, 128)         65664     
_________________________________________________________________
max_pooling2d_12 (MaxPooling (None, 3, 3, 128)         0         
_________________________________________________________________
flatten_4 (Flatten)          (None, 1152)              0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 1152)              0         
_________________________________________________________________
dense_13 (Dense)             (None, 256)               295168    
_________________________________________________________________
dense_14 (Dense)             (None, 128)               32896     
_________________________________________________________________
dense_15 (Dense)             (None, 64)                8256      
_________________________________________________________________
dense_16 (Dense)             (None, 43)                2795      
=================================================================
Total params: 422,187
Trainable params: 422,187
Non-trainable params: 0
_________________________________________________________________

For the training I used the Adam optimizer over 50 epochs with a batch size of 1024 and a learning rate of 0.001 I reached a validation accuracy of 95% and a test accuracy of 91%.

After having trained the model I downloaded some random traffic signs from google to evaluate how well my model performed on new images. Several images was classified correctly but one image turned out to be problematic for my model. The image below is showing a Work In Progress sign but it was incorrectly classified as a Bumpy road sign.

The first technique one can use to get a better understanding is to visualize the feature maps learn by a convnet. That is to feed a new input image to a convnet and instead of having a network with an softmax output the output is the activations of all the intermediate convolutional layers. This is useful when you want to understand how successive convolutional layers transforms their input and extracts features at a more and more abstract level from raw image data. In my model I use 3 convolutional layers. Lets see how the feature maps of those layers actually gets activated for the new image when it gets feed through the model.

As can bee seen in the images above the first layer tends to look at the edges of the sign to find what shape it is. This is as far as I understand a common theme for deep convnets that the initial layer work as an edge detector. Also in the initial layer much information about the initial image is still retained. For the following layers the features extracted get more abstract and not all filters get activated for this image. This is called that the activations get sparser. The black squares are filters that are not relevant for this type of image according to the model. These filters represent features present in other images. This is also a common theme in convnets. The deeper layers represent more abstract features. That is they tend to carry less information about the original image and instead carry information about the actual classes being classified.

Another way to visualize what a network has learned was recently presented in a paper called Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. What it explains is a way of visualising a heatmap of the class activation in an image. This is very cool since is gives you a way to localize objects in the image and by localizing what part of an image lead to the predicted class one might (i say might since its not trivial to understand) help you identify why the model made a specific classification. It works by taking the output feature maps of of the last convolutional layer (the third layer above) and weighing each feature map by the gradient of the class with respect that feature map (much more details can be found the paper above).

In the first picture below one can see the produced heatmap for which parts of the image the model believe makes this image of class Work In Progress (which is the correct sign). But as I said the model miss-classify this image as a bumpy road sign. Looking at the heatmap the red area representing the area most telling of this being a Work In Progress sign is in the top left corner which makes no sense.

Looking at the third image below which is a heatmap showing what part of the image makes the model think this is a bumpy road sign the model seems very certain the are in the middle of the sign is a bumpy road feature. This is why I say it might help you when debugging deep learning models since given the output it is not always obvious what to do. But I will definitely experiment more with Grad-CAM since I find it a really cool thing for localization if nothing else and also in the actual paper much better examples are given for when this can be useful.


In the next article I will describe how behavioural cloning can be used to train a deep neural network to drive a car using only image data from an actual person driving the car. That will be awesome!

Link to github repo

Fantastic work. Can't help thinking though: wouldn't it be beneficialto devise signage optimized for AI reading/interpreting?

Imponerande! Keep up the good work!!

To view or add a comment, sign in

More articles by Henrik Larsson

Others also viewed

Explore content categories