High and welcome to today's lectures. In the last part, we saw how we can use neural networks in statistical machine translation. Using these techniques, it was possible to improve the quality of statistical machine translation. Since neural networks were so successful. Statistical machine translation. People were thinking ," If we can build a machine translation, which purely uses neural networks. And about these systems. We want to talk to date. Most commonly, the term "neural machine translation is used for this. If we are talking about neural machine translation. It means we are talking about machine translation systems, which are purely built out of neural networks in general, machine translation can be viewed as a sequence to sequence task. We are getting one input sequence, which is the source sentence. And then we want to generate one output sentence, which is the target sentence. Therefore, these models are often also referred to as sequence to sequence models, while in statistical machine translation, it was successful to go from word - based to phrase - based models in neural machine translation. We will now first again, start with word - based models. That means we are generating the translation word by word. One main advantage of a neuron machine translation is that it can be trained. And to end that means that all parameters of the modeled are trained together, and they are all trained in a way that the final performance is optimal if we are comparing it to phrase - based machine translation. The picture is different, in this case where we will first train our alignment, then we will take the alignment and do the phrase "extraction" and then based on the extracted phrases, we will train our lock linear model, but it is very hard to train an alignment which is optimal for the final model. So the alignment is trained independently of the lock linear model in the end. In contrast, in your machine translation, all parameters of the network, which are all the weights of the network are trained jointly, and so they are all trained in a way they will in the end, get the best machine translation performance for a new machine translation. We will start with the encoder decoder mark. But before we take a look at the encoder decoder model. Let us have a brief review of the Rna and language model here on the right side we see a picture of an iron and language model in the language model. We want to predict the probability of word given its history. So for example, we want to predict the the probability of the word and given its history. In our case, the history is sent and start this, and is using an R and N. We have the advantage that we can predict the probability of the word given the whole history and the whole history is encoded in the hidden state of the Rna. The Rnn works in the following way. So we always put in the last word. So in the beginning, we will put in the same start, and then based on the last word and the last hidden state. We are calculating a new hidden state in this hay hidden state. We encode the whole history, and then based on the hidden state, recalculating the probability of the next word. So how do we calculate the probability of the word "n ." So we initialise our R, N, and then the first word we put in there is sentence start. Then in the next time step, we put in the word "this ," and then combine the previous hidden state with the word. The input of is into a new hidden state. And then in the next time step we put in the word is "and then generate a new hidden state in this hidden state. Now we have encoded the whole history of sent and start. This is and based on this hidden state, we are now predicting how probable the next word "n" is. So it always works that we input a word, calculate the new hidden state, predict the next word. And then we input this word, calculate a new hidden state. And so we continue in this way, we can generate a sequence of words based on these Rna and language model. Now it is quite straightforward to extend this framework into a sequence to sequence model, which uses the Ankoda decoder framework. These anchoda decoder framework is shown here on the right side. The main difference between the language modeling task and the empty task is that we do not want to predict the next work only based on the previous target words. But we want to also predict it based on all the source words in the source sentence. Therefore, we have to extend our model in a way that the hidden state not only encodes the history on the target side, but it should also encode all the information about the sewer centers. And this is done in the following way. So we are starting with our sewer sentence, then the source sentence is encoded in the encoder into a sentence representation. So we want to get a rate as the representation of the wholesaler sentence. We can there again. Just use our R and N and just take the last hidden state. Using this representation, we can then initialise our hidden state of the decode since now. The decoder is initialised with sewer sentence representation. The hidden state no longer depends only on the previous tagged words, but it also depends on the whole source sense. And therefore we can now generate a target sentence, which is the translation of the sewer sense. So the two main components are the encoder and the decoder in the encoder. We read in the Sur sentence and generate a sentence representation. And then we are having the decoder which generates the target sentence, work by word by using the sentence "representation of the encode ." So let us look at those parts in detail. As mentioned, the main task of the encoder is to read in the source sentence and then generate a fixed sized representation of the sewer sentence. So in this representation, all the content of the source ends, which is necessary to generate the target sentence should be encoded. The advantage is that we can do it very similar to our Rna and languageman. One example is here given on the left, where we have marked the Ankoda in blue. So as we have done it for the R, N language model. We read in the source sentence work by word. The only difference is that we don't need to generate any sewer's words so we can ignore the output later. We are only interested into the hidden states, because this hidden states, as we learn for the language model encodes all the history. So at the end, we have a hidden state which encodes all the sewer sentence often, especially, are an end model called long short term memory. The Lsd M models, the result of the encoder. And then here, at the end of a fixed size representation. And here it is very important to emphasise that this is a fixed sized representation. We need to have a fixed sized representation, because in the Rna n, we cannot have hidden states with different dimensions. But this also leads to problem, because it no longer matters how long the sentence is. We always have to store all our information in this fixed size representation. So if the size gets too small, we will lose some information before we are looking at decoder. Let us have a detail look at how each of these time stamps look like. So let us look in detail on this time step where we input the English word. And so how do we input these words into our encoder, as also done in all the other neural network models first will map each word into an Integer. And then we are using the so called one hot encoding, where we are having a vector, which dimension is now vocabulary size, and where one Ve- one value is one that is exactly the position Wi- of the index of this word, and all the other values are zero. So in our case, here we have as input our vector, where the position of the index of the word "n" is a one, and all the other values are zero, then we are having our first layer of our network, which is a so - called word embedding layer, which maps this representation into a continuous representation with a lot smaller dimension, so typical our input dimension of the vocabulary will be something around forty or eighty thousand words. And then we are mapping it into a dimension of a word embedding size of typically around five hundred to one thousand words. But in contrast to our own hot encoding, we no longer having a binary vector, but a continuous vector with values between zero and one. And then we are having our recur layer, where we input our word embedding, and then based on the previous state, generate the new state of the Rna. And this we are doing for all time steps. And finally, this hidden state at the last time. Step is our sentence representation. So after looking at the encoder. Now let us go for the decoder in the decoder we want to generate now the target secrets. So we want to generate each target word one by one. On the first view, it looks exactly like our language mode. The only difference is that our R n now gets initialized with the hidden state of the encoder. So at the start, we are putting in the central start symbol. But then we are combining this input with the output of the encoder to get the new hidden state. And then based on this hidden state, we are predicting the first target word. So all the information which flow from the encoder to the decoder are on this position where we calculate the first hidden state of the decoder. Again, we are mostly using in Lsdm based are an end model. And once we generated a tagged word, we will always input it in the next time step as our previous target work. So this works exactly as we have done it for the R N language models, also for the decoder. Let us have a detail look at one time step. So for example, let us look at the time step where the word "German" word "Einstein" is input. So here the first three layers looks exactly as we have it in the decoder. So again, first we were represent the word in the one hot representation. And then we will calculate the word "embeddics ," the word embeddings are then the input to the Rn and layer, which calculates the new hidden state based on the word" embedding "and the previous hidden state. So until the hidden state is exactly the same as we have it in the encoder. But now, in contrast to the anchor that we also want to predict the next tagged word, because now we want to generate a sequence and are no longer only interested in the hidden state. So then we have one additional layer where we predict the probabilities for the next target word. This is a soft max layer, where the dimensionists, again, the vocabulary size, and for every word, we calculate the probability of this word, given the history on the target and the history on the sewer side. So now you got to know the encoder and the decoder. And the nice thing about the anchov decoder framework is that, you know, already know all the important parts of this network. And this directly leads to one of the main advantages of the encoder decoder framework. It is simplicity. So if we remember how an Smt system works. They are quite a lot of different component. And you all have to use all these different components in order to build a full Smt system. So normally we start with a word alignment, then we have to do the phrase "extraction ," then we have to do the phrase" scoring ," and we also have to generate a language model. And then all these companies have to combine into the lock linear model, which altogether makes it a very complicated model. In contrast, we now got to know the encoder decoder framework. And we are seeing, we have one framework which gets as input the peril data. And then we can train the whole model. We don't have different components like word alignment, a phrase extraction. It is like one big framework, which is which can be used to generate the target sequence. And furthermore, I already mentioned in the beginning. One advantages is that it can be trained. And to end meaning that all parameters in your model are trained jointly. So it is no longer that if you improve your word alignment, you are no longer sure as this really improves. Also, your of all translation performance -cause now you have all the confidence trained together, and they are all trained in a way that the final performance should be the best. The main disadvantage of the encoder decoder framework is a bottleneck between the encoder and the decoder, as mentioned before, we are using a fixed size representation to Ancho, the source sentence. And then we are using this fixed sized representation to initialise our decoder. And the problem is that the size of this representation is fixed and thereby independent of the length of the sewer sentence. So if we have a short sentence that is of no problem. So we can encode all the information which are necessary into this hidden state and then generate the correct translation. So in the initial experiments, the performance of this system was very good at short sentences. But the longer the sentence gets, we are getting more and more problems, because we can no longer store all the information which are necessary to generate a translation of the whole source ends in this fixed size representation. First, some heiristics were presented, how we can partly overcome this problem. So it was somehow better to put in the search ends in the inverse direction. So you put in first the last word, and then the second last to the first word. But this also didn't really help for for very long sentences. So if we really want to overcome this problem and be able to somehow encode all information to sue sentence. We will later need to extend our framework. So to summarise today's lectures, we introduced the encoder decoder model as a name already says, the encoder decoder wanted to consist of first and encoder, and then a decoder. The encoder reads in the source sentence using an R, N model and then generates a fixed sized representation as a sewer sentence. And then we are having the decoder which generates the tagged sentence, work by work. In contrast to a language model, this decoder is initialized with a source representation so it can generate a translation of the sewer centers. One