Hi and welcome to today's presentation about word-based statistical machine translation models. Today, we want to introduce the first statistical model for machine translation. The idea here is that we no longer have handwritten rules like how some words are translated to another because the translation depends on a lot of different factors, like the context the word is used on and several other things. So it's very hard to always decide how a word is translated. And therefore, these models do not directly model the translation, but they model the probability. So they model the probability of a word being translated to another word. Since it's not possible to do this on whole sentences, so it's not easy to estimate how probable one sentence is given. The source sentence, we break down the process to word-based translation. So we go in a step down to every individual word and then calculate probability, how this word should be translated given the source sentence. The very nice thing about this is that we no longer have to write rules by hand, but we can learn these models from large amounts of data and it's often easier to generate this data then to write rules how words are translated. In a text, a word can have different translations and we're storing these different translations in a lexicon. Here, we see the example for the German word, wagen. This word can be translated into different English words, but not all of these words has the same probability. So some of these words are more probable and other words are less probable. That's why we go for a large corpus and count the number of co-occurrence of the German and the English word. For example in our corpus, the German word, wagen, and the English word, vehicle, co-occurred 5000 times. Then we can use a maximum likelihood estimation to approximate the translation probability. So from these counts, we get that the German word, wagen, translates into the English word, vehicle, with a probability of 0.5 and the German word, wagen, translates to the English word, car, with a probability of 0.3. The second main component of these statistical machine translation systems are the alignment. The alignment is a mapping between the source and the target words. It is implicitly given by the wor-to-word translations and it's formally defined as a function from the target words to the source words. So in our example, we see that the English word, I, is aligned to the German word, Ich, so we get the first alignment link. Then the English word, visit, is aligned to the German word, besuche. We have the English word, a, which is aligned to einen and friend is aligned to Freund. So with these alignments, we now get the alignment between the German sentence and the English sentence. So with these two basics, the alignment and the lexicon, we can go for the first statistical model, which is the IBM model 1. This is a word-based translation model where we break down the sentence probability modeling into smaller steps. It is modeling the probability of a target sentence, e, and an alignment given the source sentence, f, let's look at the formula. So here, we have the formula defined on the left side and again, the example on the right side. So first, let's look at the alignment probability. How many alignments are possible between the English sentence and the German sentence? So we saw that every English word can either be aligned to one of the German source words or if it's not aligned to any word, it will be aligned to the null word, so it can be aligned to number of source words plus one. And since this can be done for every possible target word, we have Lf plus 1 to the power of le, which is the length of the English sentence possible alignments. In the first IBM model, all alignments have the same probability. So the probability of exactly this alignment is just 1 divided by the number of possible alignments, which we have here on the left side. The second part of the probability is the lexical translation probabilities and therefore, we just multiply overall target words and then take the translation probability of the target word, given the aligned source words, which is f a from j. To make it more clear, let's look how exactly this works for our example sentence. So we have, again, our example, we have the formula and now we're calculating the probability of the English sentence and this alignment given the German sentence. So first, we're having here the normalization constant, epsilon, and then we have to look how many alignments are possible. We have 4 different source words, and the null word, which makes 5, the number of target words is 4, so the number of possible alignments is 5 to the power of 4, so we have first of 5 to the power of 4. And then going over all our target words, and adding the translation probability of I given Ich, the translation probability of visit given besuche, the translation probability of my given meinen, and the translation probability of Freund given friend. In the next step, we can now in look in our lexicon get the translation probabilities we have written down here, and put the translation probabilities into our equation. And then we're finally getting the translation probability of this target sentence and the alignment given the source sentence is 7.68 times 10 to the power of minus 5 times epsilon. So now, we have the first step, we know how to calculate the translation probability. But as we have said in the introduction that one nice thing about this model is that we are able to train it on data and to learn this translation probabilities. So how do we learn these translation probabilities? So if we would have word-aligned data, where we know for every word to which word it's translated, we could just directly count the number of co-occurrences and would get our statistical lexicon. But the problem is that this isn't available for nearly all languages and we have to deal with corpora aware where they're only aligned on sentence level. And therefore, we have the problem of incomplete data and we have to consider the alignment as a hidden variable. But there is a very nice algorithm to do this and that's the expectation-maximization algorithm. These algorithms mainly consists of four different steps. So in the first step, we have to initialize our model and if we don't know anything about the probability distribution, a very good idea is always to initialize it with the uniform distribution. That means, in the beginning, every word translates into every other word with the same probability. Then we have the second step where we apply our model to the data. So given the current model, we calculate the alignment probability. Given the lexicon we're having was uniform distribution, we're calculating the alignment. Then we have the maximization step. So given this alignment, now we learn the data, which is in our case lexicon and extract them from either the most probable alignment or as the weighted sum over all possible alignments. And these steps, two and three, we iterate then for several iterations and then our model will converge to good lexicon. With this now, we have our initial model, the IBM model 1, which we can apply to the data and which we can train. But of course, there are still several problems and we have done a lot of approximation to make this model very easy. The first thing was that we said, every alignment has the same probability. But if we now look at the the example we have here, Ich besuche einen Freund, a friend I visit, we can generate the same alignment. So we again, have Ich and I, besuche and visit, a and einen, and friend and Freund. So in this case, the probability would be exactly the same than in our initial example. But of course, this translation is a lot worse. Therefore, it makes sense to not give all alignments the same probability, but to say that alignments where we have these crossing links are less probable. Of course, there are some cases where you should have crossing alignments since languages don't have always the same word order, but in general, we have a more monotone word order, and this isn't modeled in the IBM model 1. The second example we have here, where we have the German sentence, Ich gehe zum Haus, and I go to the house, where we have an alignment from to and the, which are both aligned to the word zum, and it should be for the word zum, it should be quite probable that it's aligned to to words, but for other words, they are very probable align to only one word. And therefore it makes sense to model also this fertility which models how probable a word is aligned to one, two, three, or more words. A third difficulty, we see in this example, Ich gehe nicht, I do not go, where we have an English the, do not, we just needed to do the negation, while in German we can just say nicht. So the word do isn't really aligned to any of the German words, and we would align it to the null word. But of course, this also should be modeled, that not every word is aligned to the null word. So this is something which the more complex model try to directly model. And finally, we also have the inverse thing, where on the source side there is a word which isn't translated. So in this case, I aligns to ich, go to gehe, and home to Hause, but the German word, nach, isn't aligned to any word and of course, this also should be modeled by the model. And all these difficulties are then better modeled by more complex models, so we have then the IBM model 2, 3, 4, and even 5, which try to address these problems of the initial model. So today, we introduced the first statistical model, which is a word-based model and which models the translation probability, P, from a target sense given the source sentence, f. These models were introduced in the 90s and were the first statistical approach. Later, more complex models were introduced, but one important thing which was very long time used from these models is the word alignment, which is generated by these models. So one of the byproduct of these models is that we get an alignment between the source and the target sentence. And this is used in many other statistical models, where we use the IBM models to generate for our training data these alignment. Because if you have seen already, this isn't really easy even for humans and these models is a very nice idea to generate these word alignments. [MUSIC]