In previous videos, we saw how class activation mapping can be applied to provide class discriminative maps. We also had the opportunity to apply CAM in ECG classification. We noted that we had to modify the network architecture, which in certain cases could compromise its performance. In this video, we are going to discuss gradient weighted class activation maps, which is a technique, as we're going to see, that generalizes CAM. It can apply in a broad range of deep neural networks without changing their architecture. We're also going to see how we can combine GRAD-CAM with weighted backpropagation to deliver high-resolution explanation maps. This diagram summarizes the main intuition behind gradient weighted class activation maps. It is taken from the original publication here. The idea is that once we have an input image as well as our output class of interest here, for example, tiger cat, we can forward propagate the image through the CNN part of the model, and then through task-specific computation to obtain idle score for the category. Similar with CAM, the gradients are set to zero for all classes except the desired class. Then the signal is backpropagated through the rectified convolutional features. Remember that in CAM, we had to introduce a global average pooling. The main difference here is that this is not required. Let's see the mathematical formula behind the gradient weighted class activation maps. We see here we estimate the gradient which actually flowing park and there is the notion of global average pooled over the width and height dimension index here as i and j. These are summed up to obtain the neuron important weight, which is related with the specific class under consideration. GRAD-CAM considers not only the weight but also the gradient flowing into the last convolutional map. The reason it does that is because it needs it since there is not a global average pooling layer introduced artificially like we did with CAM. We see here that the formulation is very similar to what we had with CAM. In fact, the main difference between the important parameter A_K and W_K we had before in the CAM creation is just the constant Z which relates to the size of the filters. For a fully convolutional architecture, CAM is a special case of GRAD-CAM. Finally, to obtain the GRAD-CAM for a class C, we apply the loop parameter on the linear combination of the class-specific activation weights multiplied by the backpropagated gradient. We see this schematically also in the diagram here. In this way, we obtain our GRAD-CAM map, which is an explanation of the result tiger cat for our original picture. GRAD-CAM is a generalization of CAM. The main advantage of GRAD-CAM compared to CAM is it lie in a large family of CNNs. This can be CNNs with fully connected layers with structure outputs, or even used in task with multi-modal inputs. Therefore, we see GRAD-CAMs applied in off-the-shelf CNN-based architecture in glutic image captioning and visual question answering. Also similar to CAM, GRAD-CAM does not require the training that did neural network. Here we see an example of the results of GRAD-CAM and how they compare with weighted backpropagation. On the top, we see GRAD-CAM with respect to the output class 'Cat'. On the bottom, you see the GRAD-CAM activation with respect to the decision 'Dog' of the network. We see that GRAD-CAM similar to CAM is class-discriminative. However, it doesn't provide the highest resolution that guided backpropagation is able to give us. GRAD-CAM can be combined with guided backpropagation, just by using the dot-product of the outputs in order to derive class-discriminative high-quality explanation maps. This method is called Guided Grad-CAM. We see here another clear example of why Guided Grad-CAM offers both high-resolution maps as well as class-discriminative explanations. On the top, we see now the dot-product of the Grad-CAM with Guided Backpropagation. It's very clear, not only that the network is focusing on the 'Cat' but also the low-level features represented in this part of the picture. The same on the bottom, for the Grad-CAM activation-based map for the dog, we see that the dog features are very clear and very focused. Grad-CAM became very popular because it is relatively simple to be implemented even with complex deep neural network architectures. For this reason, several variants of Grad-CAM have emerged. One notable one is the Grad-CAM++. I showed you here a main diagram that summarize the concept behind Grad-CAM++, but it also explains very clearly what is the difference with Grad-CAM and CAM. This is taken from the original paper. In Grad-CAM++, a weighted combination of the positive partial derivatives of the last convolutional layer, feature maps is taken with respect to a specific class score. This weights are used to generate visual explanation. On the other hand, if you remember, Grad-CAM estimates the weight by dividing with a constant set, which releases the size of the feature map. If the response is more, then the weights might become smaller. The authors of the Grad-CAM++, they found an intuitive way to increase the sensitivity of Grad-CAM, with this way they're able to explain multiple instances of an object, in an image. I showed you here an example where we see several instances of the similar object and we see how Grad-CAM ignore some of it, whereas Grad-CAM++ is able to cover better the areas which relate to the specific decision of the deep neural network. This becomes clear when we combine Grad-CAM++ with Guided Grad-CAM++. Summarizing, in the last few videos, we overviewed several model specific explainability methods for deep neural networks. We also show the difference between high-resolution visualization method based on backpropagation techniques ad class-discriminative method based on class activation map that take into consideration high-level features of the last layer of the deep neural network, along with class-specific activations. In this family, we demonstrated how Grad-CAM is a generalization of CAM that allows more flexibility with relation to the network architecture. Some evidence show that Grad-CAM could provide better results in terms of coverage and explainability than CAM. Combined with Guided Backpropagation, it provides class-discriminative and high-resolution maps. We also saw that certain variations like Grad-CAM++ can significantly improve the sensitivity of the method.