[SOUND] In this lecture, we are going to talk about how to improve the instantiation of the vector space model. This is a continued discussion of the vector space model. We're going to focus on how to improve the instantiation of this model. In the previous lecture, you have seen that with simple instantiations of the vector space model, we can come up with a simple scoring function that would give us basically an account of how many unique query terms are matched in the document. We also have seen that this function has a problem, as shown on this slide. In particular, if you look at these three documents, they will all get the same score because they match the three unique query words. But intuitively we would like d4 to be ranked above d3, and d2 is really not relevant. So the problem here is that this function couldn't capture the following heuristics. First, we would like to give more credit to d4 because it matched presidential more times than d3. Second, intuitively, matching presidential should be more important than matching about, because about is a very common word that occurs everywhere. It doesn't really carry that much content. So in this lecture, let's see how we can improve the model to solve these two problems. It's worth thinking at this point about why do we have these problems? If we look back at assumptions we have made while instantiating the vector space model, we'll realize that the problem is really coming from some of the assumptions. In particular, it has to do with how we placed the vectors in the vector space. So then naturally, in order to fix these problems, we have to revisit those assumptions. Perhaps we will have to use different ways to instantiate the vector space model. In particular, we have to place the vectors in a different way. So let's see how we can improve this. One natural thought is in order to consider multiple times of a term in the document, we should consider the term frequency instead of just the absence or presence. In order to consider the difference between a document where a query term occurred multiple times and one where the query term occurred just once, we have to consider the term frequency, the count of a term in the document. In the simplest model, we only modeled the presence and absence of a term. We ignored the actual number of times that a term occurs in a document. So let's add this back. So we're going to then represent a document by a vector with term frequency as element. So that is to say, now the elements of both the query vector and the document vector will not be 0 or 1s, but instead they will be the counts of a word in the query or the document. So this would bring in additional information about the document, so this can be seen as more accurate representation of our documents. So now let's see what the formula would look like if we change this representation. So as you'll see on this slide, we still use dot product. And so the formula looks very similar in the form. In fact, it looks identical. But inside the sum, of course, x i and y i are now different. They are now the counts of word i in the query and in the document. Now at this point I also suggest you to pause the lecture for a moment and just to think about how we can interpret the score of this new function. It's doing something very similar to what the simplest VSM is doing. But because of the change of the vector, now the new score has a different interpretation. Can you see the difference? And it has to do with the consideration of multiple occurrences of the same term in a document. More importantly, we would like to know whether this would fix the problems of the simplest vector space model. So let's look at this example again. So suppose we change the vector representation to term frequency vectors. Now let's look at these three documents again. The query vector is the same because all these words occurred exactly once in the query. So the vector is still a 01 vector. And in fact, d2 is also essentially representing the same way because none of these words has been repeated many times. As a result, the score is also the same, still 3. The same is true for d3, and we still have a 3. But d4 would be different, because now presidential occurred twice here. So the ending for presidential in the document vector would be 2 instead of 1. As a result, now the score for d4 is higher. It's a 4 now. So this means by using term frequency, we can now rank d4 above d2 and d3, as we hoped to. So this solved the problem with d4. But we can also see that d2 and d3 are still filtering the same way. They still have identical scores, so it did not fix the problem here. So how can we fix this problem? Intuitively, we would like to give more credit for matching presidential than matching about. But how can we solve the problem in a general way? Is there any way to determine which word should be treated more importantly and which word can be basically ignored? About is such a word which does not really carry that much content. We can essentially ignore that. We sometimes call such a word a stock word. Those are generally very frequent and they occur everywhere. Matching it doesn't really mean anything. But computationally how can we capture that? So again, I encourage you to think a little bit about this. Can you came up with any statistical approaches to somehow distinguish presidential from about? Now if you think about it for a moment, you'll realize that one difference is that a word like above occurs everywhere. So if you count the occurrence of the word in the whole collection, then we will see that about has much higher frequency than presidential, which tends to occur only in some documents. So this idea suggests that we could somehow use the global statistics of terms or some other information to trying to down-weight the element of about in a vector representation of d2. At the same time, we hope to somehow increase the weight of presidential in the vector of d3. If we can do that, then we can expect that d2 will get the overall score to be less than 3 while d3 will get the score above 3. Then we would be able to rank d3 on top of d2. So how can we do this systematically? Again, we can rely on some statistical count. And in this case, the particular idea is called inverse document frequency. Now we have seen document frequency as one signal used in the modern retrieval functions. We discussed this in a previous lecture. So here is the specific way of using it. Document frequency is the count of documents that contain a particular term. Here we say inverse document frequency because we actually want to reward a word that doesn't occur in many documents. And so the way to incorporate this into our vector representation is then to modify the frequency count by multiplying it by the IDF of the corresponding word, as shown here. If we can do that, then we can penalize common words, which generally have a lower IDF, and reward rare words, which will have a higher IDF. So more specifically, the IDF can be defined as the logarithm of M+1 divided by k, where M is the total number of documents in the collection, k is the DF or document frequency, the total number of documents containing the word W. Now if you plot this function by varying k, then you would see the curve would look like this. In general, you can see it would give a higher value for a low DF word, a rare word. You can also see the maximum value of this function is log of M+1. It would be interesting for you to think about what's the minimum value for this function. This could be an interesting exercise. Now the specific function may not be as important as the heuristic to simply penalize popular terms. But it turns out that this particular function form has also worked very well. Now whether there's a better form of function here is the open research question. But it's also clear that if we use a linear penalization, like what's shown here with this line, then it may not be as reasonable as the standard IDF. In particular, you can see the difference in the standard IDF, and we somehow have a turning point of here. After this point, we're going to say these terms are essentially not very useful. They can be essentially ignored. And this makes sense when the term occurs so frequently and let's say a term occurs in more than 50% of the documents, then the term is unlikely very important and it's basically a common term. It's not very important to match this word. So with the standard IDF you can see it's basically assumed that they all have low weights. There's no difference. But if you look at the linear penalization, at this point that there is still some difference. So intuitively we'd want to focus more on the discrimination of low DF words rather than these common words. Well, of course, which one works better still has to be validated by using the empirically correlated dataset. And we have to use users to judge which results are better. So now let's see how this can solve problem 2. So now let's look at the two documents again. Now without the IDF weighting before, we just have term frequency vectors. But with IDF weighting we now can adjust the TF weight by multiplying with the IDF value. For example, here we can see is adjustment and in particular for about there's adjustment by using the IDF value of about, which is smaller than the IDF value of presidential. So if you look at these, the IDF will distinguish these two words. As a result, adjustment here would be larger, would make this weight larger. So if we score with these new vectors, then what would happen is that, of course, they share the same weights for news and campaign, but the matching of about will discriminate them. So now as a result of IDF weighting, we will have d3 to be ranked above d2 because it matched a rare word, whereas d2 matched a common word. So this shows that the IDF weighting can solve problem 2. So how effective is this model in general when we used TF-IDF weighting? Well, let's look at all these documents that we have seen before. These are the new scores of the new documents. But how effective is this new weighting method and new scoring function point? So now let's see overall how effective is this new ranking function with TF-IDF weighting. Here we show all the five documents that we have seen before, and these are their scores. Now we can see the scores for the first four documents here seem to be quite reasonable. They are as we expected. However, we also see a new problem because now d5 here, which did not have a very high score with our simplest vector space model, now actually has a very high score. In fact, it has the highest score here. So this creates a new problem. This is actually a common phenomenon in designing retrieval functions. Basically, when you try to fix one problem, you tend to introduce other problems. And that's why it's very tricky how to design effective ranking function. And what's the best ranking function is their open research question. Researchers are still working on that. But in the next few lectures we're going to also talk about some additional ideas to further improve this model and try to fix this problem. So to summarize this lecture, we've talked about how to improve the vector space model, and we've got to improve the instantiation of the vector space model based on TD-IDF weighting. So the improvement is mostly on the placement of the vector where we give high weight to a term that occurred many times in a document but infrequently in the whole collection. And we have seen that this improved model indeed looks better than the simplest vector space model. But it also still has some problems. In the next lecture we're going to look at how to address these additional problems. [MUSIC]