BACK TO ALL BLOGS

Inside a Neural Network’s Mind

Why do neural networks make the decisions they do? Often, the truth is that we don’t know; it’s a black box. Fortunately, there are now some techniques that help us peek under the hood to help us understand how they make decisions.

What has the neural network learned is attractive? Where does it look to decide if an image is safe for work? Using grad-cam, we explore the predictions of our models: sport type, action / non-action, drugs, violence, attractiveness, race, age, etc.

Github repo: https://github.com/hiveml/tensorflow-grad-cam

Areas of focus of a neural network via grad-cam on suggestive photos (Scarlett Johansson, Ryan Gosling)

Hey, my face is up here! Clearly, the attractiveness model focuses on body over face in the mid-range shots above. Interestingly, it has also learned to localize people without any specific bounding box information in training. The model is trained on 200k images, labeled by Hive into three classes: hot, neutral, and not. Then the scores for each bucket are combined to create a rating 0-10. This classifier is available here1.

Areas of focus of attractiveness model via grad-cam when analyzing headshot photos and facial features.

The main idea is to apply the logit layer with the last convolutional layer before global pooling. This creates a map showing the importance of each pixel in the network’s decision.

Areas of focus via grad-cam for sports action, NSFW (nudity), and violence classifier models
Sports action, NSFW, violence

The pose of the football player tells the model that a play is in action. We can clearly locate the nudity and the guns in the NSFW and Violence images, too.

Areas of focus via grad-cam for snowboarding, TV newscasts
Snowboarding, TV Show

A person in a suit, center frame, apparently indicates that it is a TV show instead of a commercial (right). The TV / commercial model is a great example of how grad-CAM can uncover unexpected reasons behind the decisions our models make. They can also confirm what we expect, as seen in the snowboarding example (left).

Areas of focus via grad-cam for animated TV shows (Rick & Morty, The Simpsons)
The Simpsons, Rick and Morty

This example uses our animated show classifier. Interestingly, the most important spot in the images above is the edge of Bart and Morty, including a substantial amount of the background in both cases.

Example architecture for CAM and GradCam analysis

CAM and GradCam

First developed by Zhou2, Class Activation Maps (CAM) show what the network is looking at. For each class, CAM illustrates the parts of the image most important for that class.

Ramprasaath3 extended CAM to apply to a wider range of architectures without any changes. Specifically, grad-CAM can handle fully connected layers and more complicated scenarios like question answering. However, almost all popular neural nets like ResNet, DenseNet, and even NasNet end with global average pooling. Therefore the heatmap can be computed directly using CAM without the backward pass. This is especially important for speed critical applications. Fortunately, with the ResNet used in this post we don’t have to modify the nets at all to compute CAM or grad-CAM.

Recently, grad-CAM++ Chattopadhyay4 further generalized the method to increase the precision of the output heat maps. Grad-CAM++ is better at dealing with multiple instances of the class and highlighting the entire class rather than just the most salient parts. It achieves this using a weighted combination of positive partial derivatives.

Here’s how it’s implemented in Tensorflow:

one_hot = tf.sparse_to_dense(predicted_class, [num_classes], 1.0)
signal = tf.multiply(end_points[‘Logits’], one_hot)
loss = tf.reduce_mean(signal)

This returns an array of num_classes elements with only the logit of the predicted class non-zero. This defines the loss.

grads = tf.gradients(loss, conv_layer)[0]
norm_grads = tf.divide(grads, tf.sqrt(tf.reduce_mean(tf.square(grads)))
	+ tf.constant(1e-5))

The pose of the football player tells the model that a play is in action. We can clearly locate the nudity and the guns in the NSFW and Violence images, too.

output, grads_val = sess.run([conv_layer, norm_grads],
	feed_dict={imgs0: img})

A person in a suit, center frame, apparently indicates that it is a TV show instead of a commercial (right). The TV / commercial model is a great example of how grad-CAM can uncover unexpected reasons behind the decisions our models make. They can also confirm what we expect, as seen in the snowboarding example (left).

weights = np.mean(grads_val, axis = (0, 1))             # [2048]
cam = np.ones(output.shape[0 : 2], dtype = np.float32)  # [10,10]

This example uses our animated show classifier. Interestingly, the most important spot in the images above is the edge of Bart and Morty, including a substantial amount of the background in both cases.

cam = np.ones(output.shape[0 : 2], dtype = np.float32)  # [10,10]
for i, w in enumerate(weights):
	cam += w * output[:, :, i]
cam = np.maximum(cam, 0)
cam = cam / np.max(cam)
cam = cv2.resize(cam, (eval_image_size, eval_image_size))

Pass the cam through a RELU to only take the positive suggestions for that class. Then we resize the coarse cam output to the input size and blend to display.

Finally, the main function grabs the tensorflow slim model definition and pre-processing function. With these it computes the grad-CAM output, and blends that with the input photo. In the code below, we use the class with the greatest softmax probability as input to grad_cam. Instead, we could choose any class. For example:

The model predicted alcohol as the top choice with 99% and gambling with only 0.4%. By changing the predicted_class from alcohol to gambling, we can see how-despite the low class probability, it can clearly pinpoint the gambling in the image.

References