[MUSIC] This lecture is about Evaluation of Text Retrieval Systems In the previous lectures, we have talked about the a number of Text Retrieval Methods, different kinds of ranking functions. But how do we know which one works the best? In order to answer this question, we have to compare them and that means we have to evaluate these retrieval methods. So this is the main topic of this lecture. First, lets think about why do we have to do evaluation? I already give one reason. That is, we have to use evaluation to figure out which retrieval method works better. Now this is very important for advancing our knowledge. Otherwise, we wouldn't know whether a new idea works better than an old idea. In the beginning of this course, we talked about the problem of text retrieval. We compare it with data base retrieval. There we mentioned that text retrieval is an empirically defined problem. So evaluation must rely on users. Which system works better, would have to be judged by our users. So, this becomes a very challenging problem because how can we get users involved in the evaluation? How can we do a fair comparison of different method? So just go back to the reasons for evaluation. I listed two reasons here. The second reason, is basically what I just said, but there is also another reason which is to assess the actual utility of a Text Regional system. Imagine you're building your own such annual applications, it would be interesting knowing how well your search engine works for your users. So in this case, matches must reflect the utility to the actual users in real occasion. And typically, this has to be done by using user starters and using the real search engine. In the second case, or the second reason, the measures actually all need to collated with the utility to actually use this. Thus, they don't have to accurately reflect the exact utility to users. So the measure only needs to be good enough to tell which method works better. And this is usually done through a test collection. And this is the main idea that we'll be talking about in this course. This has been very important for comparing different algorithms and for improving search engine system in general. So let's talk about what to measure. There are many aspects of searching that we can measure, we can evaluate. And here, I listed the three major aspects. One, is effectiveness or accuracy. How accurate are the search results? In this case, we're measuring a system's capability of ranking relevant documents on top of non relevant ones. The second, is efficiency. How quickly can you get the results? How much computing resources are needed to answer a query? In this case, we need to measure the space and time overhead of the system. The third aspect is usability. Basically the question is, how useful is a system for new user tasks. Here, obviously, interfaces and many other things also important and would typically have to do user studies. Now in this course, we're going to talk mostly about effectiveness and accuracy measures. Because the efficiency and usability dimensions are not really unique to search engines. And so they are needed for without any other software systems. And there is also good coverage of such and other causes. But how to evaluate search engine's quality or accuracy is something unique to text retrieval and we're going to talk a lot about this. The main idea that people have proposed before using a test set to evaluate the text retrieval algorithm is called the Cranfield Evaluation Methodology. This one actually was developed a long time ago, developed in 1960s. It's a methodology for laboratory test of system components. Its sampling methodology that has been very useful, not just for search engine evaluation. But also for evaluating virtually all kinds of empirical tasks, and for example in natural language processing or in other fields where the problem is empirical to find, we typically would need to use such a methodology. And today with the big data challenging with the use of machine learning everywhere. This methodology has been very popular, but it was first developed for a search engine application in the 1960s. So the basic idea of this approach is to build a reusable test collection and define measures. Once such a test collection is built, it can be used again and again to test different algorithms. And we're going to define measures that allow you to quantify performance of a system and algorithm. So how exactly will this work? Well we can do have a sample collection of documents and this is adjusted to simulate the real document collection in the search application. We're going to also have a sample set of queries, or topics. This is a little simulator that uses queries. Then, we'll have to have those relevance judgments. These are judgments of which documents should be returned for which queries. Ideally, they have to be made by users who formulated the queries. Because those are the people that know exactly what documents would be used for. And finally, we have to have matches for quantify how well our system's result matches the ideal ranked list. That would be constructed base on user's relevance judgements. So this methodology is very useful for starting retrieval algorithms, because the test can be reused many times. And it will also provide a fair comparison for all the methods. We have the same criteria or same dataset to be used to compare different algorithms. This allows us to compare a new algorithm with an old algorithm that was divided many years ago, by using the same standard. So this is the illustration of this works, so as I said, we need our queries that are showing here. We have Q1, Q2 etc. We also need the documents and that's called the document caching and on the right side you will see we need relevance judgments. These are basically the binary judgments of documents with respect to a query. So for example, D1 is judged as being relevant to Q1, D2 is judged as being relevant as well, and D3 is judged as not relevant. And the Q1 etc. These will be created by users. Once we have these, and we basically have a test collection. And then if you have two systems, you want to compare them, then you can just run each system on these queries and the documents and each system would then return results. Let's say if the queries Q1 and then we would have the results here. Here I show R sub A as the results from system A. So this is, remember we talked about task of computing approximation of the relevant document set. R sub A is system A's approximation here. And R sub B is system B's approximation of relevant documents. Now, let's take a look at these results. So which is better? Now imagine if a user, which one would you like? Now let's take a look at the both results. And there are some differences and there are some documents that are returned by both systems. But if you look at the results, you will feel that maybe A is better in the sense that we don't have many number element documents. And among the three documents returned, the two of them are relevant. So that's good, it's precise. On the other hand one council say maybe B is better, because we've got all of them in the documents. We've got three instead of two. So which one is better and how do we quantify this? Well, obviously this question highly depends on a user's task. It depends on users as well. You might even imagine for some users may be system A is better. If the user is not interested in getting all the random documents. Right, in this case the user doesn't have to read a million users will see most of the relevant documents. On the other hand, one can also imagine the user might need to have as many random documents as possible. For example, if you're doing a literature survey you might be in the sigma category, and you might find that system B is better. So in the case, we will have to also define measures that will quantify them. And we might need it to define multiple measures because users have different perspectives of looking at the results. [MUSIC]