Now that we've looked at some advanced computer vision techniques, there's one more area that you should look at. That's visualization and interpretability of what computers see when they're classifying images or detecting objects within them. This is a huge area that could take several courses in its own right, but this week, we'll look at some techniques for it, and we'll explore a case study of where it was successful in expanding the accuracy of some state of the art models. There's an old story, somewhat legendary. So it's not clear if it's a 100 percent true or not, but it's great for illustrating the point. It dates back many years, and it comes from an experiment where the US Army wanted to build a device that could detect a camouflaged tank. The team tasked with building this detector needed to train a model with data for tanks that are camouflaged and tanks that were not. They were able to borrow a tank for few days and drive around the countryside and take lots of pictures. Then later, they got the camouflage netting and they did the same thing. Taking lots of pictures of the tank being camouflaged. They trained a model, and it appeared to give great accuracy. They were excited that they might have trained something new and very, very cool. But then they took it into the field to test it. When trying to detect hidden tanks, it failed completely. Had worked amazingly well with their training, testing, and validation data, but they were at a loss for why it was failing in the real-world. So they had to start over. They looked over all of their images time and time again, and they couldn't see anything that could have caused the problem. They had huge coverage of different angles of the tank with and without the camouflage netting. Then they noticed, that on the days that they'd had the camo netting for the tank, the weather was cloudy. On the days that they didn't, the weather was sunny. Instead of building a camouflage tank detector, they had instead built a cloudy sky detector. Now this might not be true story or the details might have been lost down through the years. But it is a funny illustration of the point. When using a convolutional neural network to extract features from an image and then classifying based on the features that were extracted, the process of filtering the image and then pooling it, means that a lot of interpretable information could be lost. It can become hard for us to understand exactly what the network sees when it does the classification. In the case of the tank story, it took people looking at the original images to see a distinction. In that case, the original images had a clear delineating factor that they could figure out. But what if in your case you don't have that, and it's not as obvious as a cloud era sunny sky. What are ways that you can use to understand how the computer can see your images? We'll look at a few methods for that this week, starting with a simple example to illustrate a methodology for figuring out what the computer picks out as irrelevant details in images and then visualizing them. For example, imagine a simple classifier for Fashion MNIST, where we use four layers of convolutions and each has an associated pooling layer before we feed it into a dense layer for classification. What happens to the image as it travels through this network? Well, we can learn and see a little about that as we explore the model.summary outputting keras. For example, for the Fashion MNIST model that we see here that you'll use in a Colab shortly, the model layout looks like this. We can see that we have the four convolutional layers, and these will have 16, 32, 64, and 128 filters respectively. We also have max pooling layers after each of the first three convolutions. Recall that max pooling keeps the highest valued pixel out of the pool. Max pooling will effectively ignore the pixels that are not the maximum. For the last layer, I decided to do an average pooling so that some of the lesser intensity pixels, which are not the maximum, would be passed on in the pooling layer. In particular, as we look at the small size of the image once it reaches this layer, max pooling could leave us with very little information. Let's explore the journey of the image as it goes through the layers. Fashion MNIST images are 28 by 28, and we can see that here. The 16 is the number of filters in this convolutional layer. I use padding equals same on the convolutional layer to keep the image 28 by 28 after the filters are passed over it. Typically, a three by three filter would remove a one pixel border, giving you 26 by 26 image output. The two by two max pool will then have the resolution of the image on each axis, so a 28 by 28 becomes 14 by 14. Similarly, the next max pool will make it seven by seven. The next one, we'll round it down to a three by three. By the time the image comes out of the final convolutional layer, there'll be many three by three images, and these will get fed into the dense layer for classification. It would be nice for us to then take the output of the final convolutional layer and look at the classification for it so that we could then figure out how the values from this layer end up determining the class. When using a sequential model, we can access the layers by Index to find their name, counting from the bottom. For example, the final dense layer was at index position minus one, the conv2d_3 is at minus three, and it's a handy way for finding out the index of the layer, which you'll see how to use in a moment. Using the functional API, we can create a new model with the same input as the previous model, but it has two outputs. One is the final conv2d at index minus three, and one is our dense output at index minus one. Now I can call them model and get back to responses. The first features is the output from the conv2d layer, and that's the one from index position minus three. The second is the output from the dense layer, and that's my classifications, and that was at index position minus one. Now you have the information that you need to calculate what's called a Class Activation Map, which you'll see next.