Attention is a hot research area in deep learning. It is inspired by human brain function, and it is a powerful mechanism. Here we are going to explore how to add attention in our deep neural network. In this way, it is possible to enhance performance and also provide local explanations for the deep neural network decisions. In the context of neural networks, attention is a technique that mimics brain function and perception. The effect enhances the important parts of the input data and fades out the rest. Consider, for example, the scenario that you attend a talk, it is quite likely that you will focus on the speaker, and you will ignore the surrounding. In other words, we tend to selectively concentrate on a part of information when and where it is needed. And we ignore a large amount of information at the same time. In this particular example, we're not going to observe the whole scene. But instead, observe and pay attention just to specific parts, and this applies to all our senses. This is a means for us to quickly select high-volume formation from massive information using limited processing resources. Here we see that the attention in human is driven by two mechanisms. A bottom-up attention, which is an unconscious attention called also saliency based attention and is driven by external stimulus. We saw that deep neural networks up to a point, they display this property by detecting saliency features. Remember, for example, convolution neural networks. Humans also depend on top-down conscious attention, also called focused attention. Focused attention refers to the attention that has a predetermined purpose and relies on a specific task. It enables human to focus attention on a specific object consciously. Attention is very important because in the case of limited computing power, it can process more important information with limited computational resources. Here we are going to see how bit attention in recurrent neural network. We see here a standard recurrent layer that we have discussed before. Based on this layers given a sequence, for example, of previous words, we can predict the next. So here is an input, we can see we have the vector X1, X2, X,3, and XN. It is fair to layer one step at a time. In this updates, the hidden states each time, which here are denoted with h. The input sequence is the embedding of the input words, for example, or ECG signal, or the hidden state from the previous layer. The output of the recurrent layer is a vector with the same length as the number of units in the recursive neural network layer. This can be fed to a dense layer with a soft max output to predict the distribution for the next node. Or the next word, or the next ECG beat, for example, in the sequence we are observing. Here we're going to see an example of how to pay attention in recurrent neural networks. So we see a standard recurrent layer, and we see the hidden state h1, h2, h3, and hn. These are a vector of length equal to the number of units in the recurrent layer, and they are passed through a dense layer, typically also called an alignment function to generate e. Is a normalization step in terms of our hidden layers? And typically is combined with a hyperbolic tanh function. Subsequently, the softmax function is applied to the vector e to produce the vector of weights a, which are also called Important score or attention scores. A in other words is a lendable function, and it reflects the importance of hidden state hj to the next hidden state. Aj is amount of attention that the i output should pay on the input. Also, we can see that the hidden layer, each of the hidden layers are multiplied with its respective way, Alpha and there is also some to give the context vector. So the context vector has same length as the hidden state vector. The context vector represent the relationship between the current output in each term of the entire input sequence. This is passed to a dense layer typically, which is a softmax layer. And usually, it outputs a distribution of the potential next, let's say state or word. So we saw here that attention mechanism is powerful, and it helps a network decide which previous state of the recurrent layer are important for predicting the next step in a sequence. Encoded decoder architectures are very powerful in deep learning. And for this reason, they have been used in several applications, including natural language processing, computer vision, and healthcare Informatics. Here we're going to see how to build the tension into an encoder decoder network. In particularly, we're going to see a recurrent neural network and code there and decode there. The attention layer as we see here is placed after the encoder. The architecture as you see is very similar to what we have seen previously with one key difference. The hidden state of the decoder is also involved into the mechanism of attention. So the motor is able to decide where to focus not only through the previews, encode the hidden states but also with the current decoder hidden state. There are many copies of the attention mechanism within the encoder/decoder network, but they all share the same way. So there is no extra overhead if the number of parameters to be left. Here we see that the context vector C is concatenated with the incoming data y to form an extended vector of data into each cell of the decoder. Thus, we treat the context vector as additional data to be feed into the decoder. There is extensive research on how to improve attention mechanism. Here we see a generalized model of attention based on a recent review article. In attention network, we first encode the source data features as gay also called keys. Keys can be expressed in various representation according to specific tasks in neuron architecture. Keys here represent different areas of an image word embedding, as we saw in our previous recurrent neural network example, they represent hidden states of the network. Usually we also need a task related representation. Which we call the query. Just like the previous hidden state of the output. Then the neural network computer correlation between queries and keys through a score function f which is called the energy function. And in this way, it estimates the energy scores that reflect the importance of queries with respect to keys in deciding the next output. Therefore, queries are a set of vectors to calculate attention for, whereas keys is a set of factors to calculate attention against. The score function f is a crucial part of the attention model because it defines how keys and queries are matched or combined. We've seen before the f function, which was equivalent to the alignment function which was represented based on a dense layer and atomic hyperbolic function. There are less computationally expensive ways of performing this attention mechanism through, for example, product. The distribution function g corresponds to the softmax layer that we've seen earlier in our recurrent neural network. And it is used to normalize all the energy scores to a probability distribution. There are several j functions explored by researchers. Because attention distribution function has a great influence on the computational complexity of the whole attention model. Attention mechanism is a significant breakthrough in deep learning, and they have been exploited to improve the performance of deep neural networks significantly. They're still under intensive investigation and several of their mechanism, they can be adjusted in several aspects, such as the score function, or the distribution function, or the combination of values and attention weights in the network architecture. As we're going to see in subsequent videos, attention can also be exploited to provide human understandable explanations.