This sequence of videos is about the design and analysis of heuristics, algorithms which are generally guaranteed to be fast but not guaranteed to be 100% correct. As a case, study we're going to revisit the knapsack problem that we thought about earlier in the dynamic programming section. Let's briefly review the three strategies for dealing with NP-complete problems. All of these are relevant for the knapsack problem. The first strategy is to identify computationally tractable special cases of the NP-complete problem that you're grappling with. If you're lucky, the instances relevant for your application will fall squarely into one of these computationally tractable special cases. Even otherwise, it's useful to have these subroutines lying around, perhaps with a little bit of extra work, you can reduce your problem instances to one of these computationally tractable special cases. A dynamic programming solution for the knapsack problem identifies one such special case. Namely, knapsack instances where the knapsack capacity is small. For example, if it's comparable to the number of items. The second approach is the design and analysis of heuristics, algorithms which are not guaranteed to be 100% optimal. I'll have more to say about those in a second. The third strategy is to insist on a correct algorithm, and being an NP-complete problem, you're then expecting to see exponential running time. But still to have running time qualitatively better than what you'd achieve with naive brute force search. Our dynamic programming solution to the knapsack problem can be viewed as a contribution in this direction. Now, you've brute force search, enumerating over all possible subsets of items. It's going to take time proportional to 2 to the n, where n is the number of items. The dynamic programming algorithm takes time proportional to n, the number of items, times capital W, the knapsack capacity. For most problems of interest, this dynamic programming solution, albeit exponential, for reasons we've discussed in the past, will be a significant improvement over naive brute force search. Let's now return to the subject of heuristics, the topic for today. Happily despite being NP-complete, the knapsack problem admits some quite useful heuristics. We're gong to study two of them. First, we'll use the greedy algorithm design paradigm to come up with a pretty good heuristic. Pleasingly, this pretty good greedy heuristic is also blazingly fast. We'll then pull out a different tool, namely dynamic programming, to develop yet another heuristic. It's going to be polynomial time, not as fast as the greedy solution, but its accuracy will be excellent. It will get within a 1- epsilon factor of optimal where epsilon is a parameter that we can tune as we wish. Now, there's zillions and zillions of algorithms in the world, and anybody can write one of them down. So if someone proposes an algorithm to you, the onus is on them to convince you that it's a good idea, to convince you that it's, in some sense, a good algorithm. Conotically, in these courses, I've been making that argument in two ways. First of all, I provide you with an argument that the algorithm I proposed is correct. It always is guaranteed to solve the problem that we're interested in. Secondly, I argued that it solves the problem using a reasonable amount of resources. So one great exhibit A would be, say, Dijkstra's algorithm. We wanted to solve the shortest path problem with a single source and non-negative edge ones. I gave you a proof that Dijkstra's algorithm always successfully achieves that goal, and I showed you an implementation so that the running time was not much more than what it takes to just read the input. Now, of course, this best case scenario is not going to be realized for NP-complete problems, and in particular, for the knapsack problem. So you have to give up either on the correctness or on the running time, you have to make a compromise. A tricky question then is, how do you argue that the compromise you're making is not overly severe? How do you argue that you are pushed to an NP-complete problem is a good one? Heuristics relax correctness. And so, one would like an argument that says correctness is not relaxed by too much. In the best case scenario, when you propose a heuristic, you should supply a performance guarantee, which says that the solution is almost correct in some sense on every single instance, or failing that, at least on many instances of interest. Now, in practice, you do bump into problems that are so hard you wind up having to resort to heuristics where you really don't have a good understanding of how or when they work, where you really just cross your fingers, run the heuristic, and hope for the best. Local search is probably the preeminent example of this type of heuristic, something like the k means in the algorithm for clustering. But for the problem we're going to talk about today, we're going to seek out heuristics that have worst case performance guarantees. And of course, a broader theme of this discussion, and the reason I'm using knapsack as an example is this will show us how our programmer toolbox, our algorithm toolbox, is now rich enough to design not just lots of different exact algorithms but also lots of useful heuristics for NP-complete problems. So let me briefly remind you of the knapsack problem is as input, we are given two n plus one numbers. There's n items, each one has a positive value and a positive weight. The final number is a knapsack capacity, capital W. The goal is to pack the knapsack with the most valuable set of items. That is you're seeking out a subset s of items, and it should have the property that the sum of the sizes of the items in your set s is bounded above by the knapsack capacity, capital W. Subject to that constraint, your subset of chosen items should maximize the sum of the values. This is a fundamental problem that comes up all the time, and particularly as a subroutine, really whenever you have a constrained resource that you want to use in the best way possible, that's basically the knapsack problem. So let's turn to heuristics for the knapsack problem. So now that I've told you that we're happy with heuristics, we're willing to relax correctness, this, all of a sudden, resuscitates the greedy algorithm design paradigm as a feasible approach to the knapsack problem. Because of the knapsack problem is NP-complete, we certainly are not expecting to find a exactly correct greedy algorithm, but maybe there's a greedy algorithm which is pretty good, and we're expecting most greedy algorithms are going to run extremely quickly. So let's talk through a potentially greedy approach to the knapsack problem. Probably the first idea you'd have would be to consider the items in some order. And when you consider an item, you make an irrevocable decision at that point whether to include it in your solution or not. So the question that is, in what order should you look at the items? Well, what's our objective? Our objective is to maximize the value of our set. So obviously, high value items are important. So maybe the first idea would be to sort the items from highest value to lowest value. But if you think about that proposal for, say, 30 seconds, you quickly realize that this is a little naive. This is not the whole story. If you have a high value item that fills up the whole knapsack, it seems like that's not quite as useful if you had an almost as high value item that basically had size close to zero but didn't use up any of the knapsack at all. Remember that each item has two properties, not just its value but also its size, and both of these are important. We want to prefer items that have a lot of value, but we also want to prefer items that are small. So I hope many of you are now sensing a strong parallel between the present discussions and the discussion we had with our first serious study of a greedy algorithm. That was scheduling jobs on a single machine to minimize the sum of the waited completion times. And that old scheduling problem, just like in this one, each object had two parameters. Back then, it was a job length, and a job weight, or a job priority, and you had a preference for jobs that were higher weight, you wanted them to be first, and you had a preference for shorter jobs, you wanted them to be first. Here again, we have items, two parameters, the value and the size. High value is good, low size is good. Back in the scheduling problem, the key idea was to look at job's ratios and schedule in that order. By analogy, here in the knapsack problem, if you want to take the two parameters of each item, form a single parameter by which we can then sort the jobs, a natural first cut to look at is a ratio. Since we prefer high values, and we prefer low weights, the sensible thing to look at is the ratio of the value of an item divided by its size. And we'll then going to consider items from the highest values of these ratios to the lowest values of these ratios. One way to think of this is that we consider the items in order of non-decreasing bang-per-buck. We're always trying to get the most value for our money, where money is the amount of the knapsack capacity that we're spending. So now that we have our greedy ordering, we just proceed to the items one at a time, and we pack the items into the knapsack in this ordering. Now, what happens here, and didn't actually trouble us in the scheduling problem, is at some point, we might no longer be able to pack items into the knapsack, the thing might get full. So once that happens, once we've reached an item which doesn't fit in the knapsack given the items that we've already packed into it, we just stop the greedy algorithm. So if you think about it, the way I've written this algorithm aborting as soon as you fail to pack one item is not really what you would do in practice. You'd do something a little bit smarter. If you fail to pack some given item, say item k plus one, well, maybe item k plus two, it has a worse ratio, but maybe it's much smaller. Maybe item k plus two does fit in the knapsack, and then there's no reason not to just take it. But in practice, you'd go through the entire list of items, and you would just pack whatever fits, skipping whatever doesn't fit. For simplicity, I'm going to go ahead and analyze this slightly worse heuristic in the slides to follow. So just a real quick example, let's consider the following three item instance. So I've taken the liberty of sorting the items by ratio. The first item has value two and size one for ratio two. The second item has a ratio of four-thirds, value 4, size 3, and the third item has a ratio of 1, value and weight equal to 3. Let's say the knapsack capacity is 5. So what does the greedy solution do? It first packs in item number one, then there's a residual knapsack capacity of 4, so there's room for item number 2, it packs that. Now, there's a residual knapsack capacity of only one, so there isn't any room for item number 3, so it halts. So the greedy algorithm is just going to output the set 1, 2. In this particular example, you'll notice that this greedy solution of value 6 is in fact the optimal solution, you can't do better. But let's look at a second example with just two items. So in this particular instance, what is the value of the output of the greedy algorithm, and what is the value of an optimal solution? So the correct answer is A. Since the knapsack capacity is 1000, and the sum of the sizes of the job is 1001, there is no room for both of them. The greedy algorithm, unfortunately, because the first tiny item has a smaller ratio, will pack in item number one. And that leaves no room for item number two, so the value of the greedy algorithm solution is just two, whereas the optimal solution is of course to just take the second item. Yeah it's ratio is worse, but on the other hand, it fills up the whole knapsack. And so overall, your value is 1000, which obviously blows away the greedy solution. This quiz shows that, at least for general instances, the greedy knapsack heuristic we've devised thus far can be arbitrarily bad. It can be terrible with respect to an optimal solution. There is, however, a simple fix to address this issue. We're going to add a very simple Step 3 to our greedy heuristic. So the new Step 3 is just going to compare two different candidate solutions and return whichever one is better, which everyone has a bigger total value. The first candidate is just what we were doing before, it's just the output of Step 2 of the algorithm. The second candidate is whichever item by itself has the maximum value. In other words, this new greedy algorithm, it just runs the old one, but then it does a final sanity check. It looks at each item individually, and it says, well, if this item, just by itself, dominates the solution I've computed thus far, I return this single item by itself instead. There's no question that adding this new Step 3 fixes the problem in the quiz in the last slide. If you add this Step 3, then we will choose that item that has value 1000 instead of the lousy solution of value 2 that we computed in Step 2. But it may seem that this Step 3 is just kind of a hack. Meet to address that specific instance but remarkably just adding the Step 3 transform this greedy heuristic into one that has a reasonably good worst case performance guarantee. Precisely, the value of this solution computed by this three-step greedy algorithm is always at least 50% of the value of an optimal solution. So the jargon for this kind of algorithm is a one-half approximation algorithm for every instance it's guaranteed to compute a solution with value at least one-half that of an optimal solution. Also, as you would hope from a greedy heuristic, it runs extremely quickly. Basically, all we do is sort the items so the running time is O of n log n. In the next video, we'll provide a proof of this theorem. And as we'll see, under some additional assumptions about the type of instincts, the performance guarantee will in fact be much, much better than 50%.