My last post described Convolutional Neural Networks (ConvNets) and some obstacles to understanding them. ConvNets identify objects in images by computing layers of features and using the last layer for object identification rather than directly using the input. Layers closer to the input capture low level visual concepts (e.g., edges and colors) and deeper layers capture higher level concepts (e.g., faces). We visualized features with “gradient” images (like the one above) which highlighted image pixels that had a significant effect on the feature in question. The conclusion was that ConvNets compute features that detect eyes and noses, but there wasn’t much intuition about how those eyes and noses are computed.
Here I’ll introduce a visualization technique that tries to do that. Instead of visualizing individual neurons, we’ll look at how neurons are related. A nose detector is made from a combination of features in the previous layer. To understand such a composition of features we’ll relate visualizations from one layer to visualizations of adjacent layers.
All of the code I used to produce these visualizations is available on github!
So far we’ve understood feature visualizations by explaining that they select for either (1) concepts humans can label at a high level (e.g., face) or (2) concepts that humans relate to pixels (e.g., edge; low level). The first occurs in deeper layers which are closer to the output and the second occurs in shallow layers which are close the input. We can see that middle layers of neurons might be useful because they capture structure—certain configurations of near-by pixels that occur more than others. But we do not have more specific intuition as for the very shallow and very deep features. This means we can’t fit the outputs (high level) and inputs (low level) of a ConvNet together intuitively. Even though we can relate some features in specific ways (e.g., a head should have an ear), these examples stay within the realm of low level or high level and do not cross boundaries. This problem is ill defined and purely one of intuition, so it’s not clear what a solution looks like, but it motivates the search for new visualization meethods.
ConvNets learn hierarchies of features. Hierarchies are graphs with nodes and edges, so to understand hierarchies we need to understand edges as well as nodes. It makes sense to visualize relationships between features.
One interesting visualization which already does this is DrawNet, from Antonio Torralba at MIT. This is a really neat visualization, which you should go check out (though perhaps later ;). It uses the empirical receptive fields from this paper about emerging object detectors along with a visualization that connects individual neurons that depend heavily on each other1. However, it’s still not image specific, so it can only say how a ConvNet would “like” to relate some parts to others instead of how a ConvNet does relate some parts to others.
Our strategy will be to show single-feature visualizations from different layers and connect them when one feature heavily influences another. You can skip to the examples if you want to avoid a small bit of math.
It would be nice to connect neurons in adjacent layers so they are only connected when
If the first doesn’t hold then we’ll end up visualizing a pattern which isn’t relative to the example at hand. If the second doesn’t hold then we’ll relate two features which the ConvNet does not think are related.
There’s a fairly natural way to do this. Consider some fully connected layer
which produces activations using
Where is the output of the previous layer and
is the weight matrix for layer
The following vector weights
fc6 features highly if they are related to
fc7, satisfying (2):
The activation vector directly represents patterns detected in
fc6, satisfying (1).
That means an element-wise product of the two vectors scores
features highly when they satisfy both (1) and (2), and lower otherwise:
To relate a feature in layer 7 to layer 6 we can use the top 5 indices of this
vector to select the
fc6 features to visualize. Because the
visualizations of fully-connected layers turns out to not be very informative and
because convolutional layers contribute the most to depth, we
need to be able to do this for the feature maps produced by convolutional layers.
Fortunately, the extension is simple and I’ll show it using
conv5 instead of
conv4 instead of
Instead of computing the gradient we’ll define a function and compute the gradient with respect to that function. Specifically, let be the sum of all pixels in an image. Then we’ll compute
where is a vector of all the pixels in the th feature map
conv5 and is a vector of all the pixels in all chanels of
Then we can weight
conv4 pixels using
This is a generalization of the first approach, where there is only 1 pixel to be summed up, so becomes the identity function. That means it also extends to cases where a convolutional layer is connected to a fully-connect layer. For details, see the code here.
Here’s an example which shows the 600th
fc6 neuron in terms of the
conv5 units (20, 154, 247, 159, and 222) which most contribute to it.
Hovering your mouse over the image causes it to zoom.
Before showing off some more examples, I have to mention one
trick I found helpful for creating relevant visualizations.
Visualizations of type 2 applied to convolutional
layers usually set all pixels but one (the highest activation)
to 0. When visualizing a unit in a convolutional
layer, say unit 44 in
conv4, with respect to the layer above, unit 55 in
I could visualize unit 44 by taking a spatial max and setting all other
pixels to 0, however, that max pixel might be in a location far away from the
max pixel in unit 55, so those parts of the image might not be related and I might not be able to compare the two neurons.
To fix this I compute the gradient of unit 55 w.r.t. all units in
conv4 feature maps to 0 except unit 44. This means a local region of pixels
is visualized in unit 44 instead of a single pixel, but that local region is guaranteed
to be centered around the highly activated pixel in unit 55, so I can relate the two units2.
Why is this a cat?
Well… the highest
fc8 activation is unit 284, which means “Siamese cat”.
Grumpy Cat turns out to be a Snowshoe Siamese according
to Wikipedia and the ILSVRC doesn’t have “Snowshoe Siamese”. Let’s start at
Unfortunately, these don’t seem very interpretable, probably because these
layers have no spatial resolution. That means we can’t really relate classes
to pixels. However, the convolutional layers tell a different story.
starts with high level interpretable features which we understand as cat
parts, so maybe we’ll be able to almost relate ConvNet outputs to inputs.
conv5 visualizations rooted in
quite saturated, so let’s start with a visualization rooted at
Some units in
conv5 are looking for faces. Some units are looking
for ears. The visualized units of
conv5 which contribute to
seem to look for different parts of the cat which happen to show up in
It becomes easier to tell what’s going on once the patch visualizations are added, but it’s hard to fit all 5 on the same screen, so let’s set aside two of them. These are looking for wheels (they look like eyes) and generic animal faces (lots and lots of conv5 activations look for some sort of animal face).
Now we can focus on more interesting neurons.
conv5_129 is looking for eyes,
conv5_247 seems to be looking
at furry stuff, and
conv5_243 is looking for dog faces.
Those are high level categories that could clearly contribute
to classifying Grumpy Cat as a cat.
furr + face + eye (perhaps tire) + … = cat
Let’s continue to
conv4 to see what makes up a dog face.
Again, we’ll discard two neurons
and keep the rest.
There are lots face detectors (even in
conv4!), some sort of pattern
that seems to find noses (
conv4_265), and another eye detector.
I didn’t try to find the redundant face and eye detectors, they
just seem to occur quite frequently. There’s also a noticable difference in the amount
of complexity and context around the eyes and faces, with more complexity and
context in deeper layers.
dog faces + nose + eye + … = dog face
I’m still interested in faces, so let’s look at
We’ll throw out 2 neurons
and keep the rest.
We got rid of a simple eye detector and a vertical white line / face detectors and kept some more interesting features.
conv3_374 looks for fur and eyes together,
conv3_240 looks for two eyes and a nose, and
looks for a snout with a pink thing (tongue) underneath it.
face + eye +
conv3_204+ … = dog face
conv3_204 are somewhat well defined parts,
but it’s hard to name them.
We’ll continue with
The left neuron (
conv2_13) sort of looks like it activated
for the cat’s cheek, but I can’t quite tell what the right one does.
These are clearly lower level than
conv3, where there was still
a simple eye detector and a simple face detector.
I can see how
conv2_21 might be combined to create an eye detector
where the first looks for insides of eyes and the second looks for outlines.
might also be used for noses, but it’s not a very strong connection.
dark black rectangular blob (nose) + circle outline (eye part 1) + small dot (eye part 2) + … = 2 eyes + nose
Finally, we visualize
conv1, which looks for straightforward motifs
like the classic Gabor filters.
circular blob + black and white grid + colored edge + … = small dot
The visualizations sort of answered why the CNN called the image a cat. The concepts represented by a neuron’s visualization and that of its parent typically fit together well. You can see how one is composed of the other. However, the visualizations, our capacity to understand them, or the network breaks down at some point.
Another interesting observation is the amount of redundancy between layers. For example, multiple layers seem to contain more than one face detector. It could be that the visualizations aren’t good enough to distinguish between features that really are different, that the network really learns redundant features, or that humans could do a better job at seeing the differences. Some of my recent work suggests the networks do learn significantly redundant features, so the models could be at fault.
In this post I explored how we understand ConvNets through visualization. We tend to focus on individual neurons, but not how they relate to other neurons. The latter part is essential because ConvNets are supposed work by learning to compose complex features from simpler ones. The second part of the blog proposed a way to combine existing visualizations in a way that relates units across layes in a neural network. These visualizations offer some insight into relations between features in adjacent layers, but this is only a small step toward a better understanding of ConvNets.
It looks like parallel coordinates, but it’s not really. The axes correspond layers of a CNN and the values on each axis correspond to individual neurons. Placing neurons on an axis implies an order (e.g., neuron 5 is greater than neuron 88), but no such ordering of neurons in the same layer exists. ↩
This tends to saturate visualizations, especially as they grow further from the root activation. I couldn’t find simple rule that consistently prevented saturating, but I found the saturated versions still work work well enough. Let me know if you come up with something. ↩