In the last practical, we implemented the brute force method of calculating the shortest common superstring. The problem with this method is that we have to compute the superstring for every permutation of reads, and, as our list of reads grows, this will become exponentially large and it will take forever to run. A better way of doing this is using the Greedy algorithm, which is going to be approximate, it will get somethings wrong. But it will be much faster, and much more feasible to run in practice. So, I've already pasted in here the overlap function, which we've seen a lot of times. I've pasted the shortest common superstring function we wrote in the last practical, and now we're going to implement the Greedy, shortest common superstring function. So the way this is going to work is we're going to to look our set of reads, and we're going to first find the two reads with the maximum overlap. And then we're going to combine those two reads, and we're just going to repeat that until we've combined all the reads into a single thing. So a helper function I'm going to write is pick maximal overlap, and this will take a set of reads and a minimum overlap, K, and it will return the two reads with the maximum overlap along with that overlap. So, to do this I'm going to define read A and read B to be none, and then I'll keep checking my best overlap length, which initially will be zero. And I'll say for each pair of reads A, B, in intertools.permutations, two. So I'm going to take pairs of reads. So for every pair of reads, I'm going to calculate the overlap length. And then if this overlap length is larger than the best overlap length that we've seen so far, store these reads as our best ones that we found. So I'll say if overlap length is greater than best overlap length. And best overlap length equals overlap length. And when we're done all this, we'll just return the best read A, read B, and overlap length that we found. >> So one of the things that might happen is there might be multiple pairs that are tied for the longest overlap and the way we've written this function basically, just the first one we encounter, the first one encountered by the loop is the one that we're going to import. Right. So now let's write our Greedy shortest common superstring function. This one will also take a set of reads and minimum of overlap K. Like I mentioned, we're going to start off by just calculating the two reads, A and B, with the best overlap, the maximum overlap that we can find. So I'll use my pickMaximalOverlap function. And now, I'm going to put a while loop, as long as the overlap length is greater than zero. We're going to replace reads A and B, in the set of reads, with their overlap. So I'll do reads.remove(read_a). Remove read_b. And now we'll append to this list the combination of these reads. So that will be read A plus the suffix of read B, since we don't want to contaminate the overlap region. And so what we've done is we've found the two reads in our set that have the best overlap, and we replaced those two reads with the overlap read. And we're going to keep doing this until we have one read left, which will be our shortest common superstring. So now I just need to recalculate the new AB and overlap length with the greatest overlap. And when this is done, I just want to. So the remaining reads, when we're done with this, will be all of the reads that don't have any. So in this case we can just join them all together, it doesn't matter and this will be our shortest common superstring. >> Like we said in the lecture, as we merge and merge and merge we might get to the point where there are no more edges, no more overlaps left in the overlap graph, but there are still multiple nodes left, in which case we can just concatenate the node labels together to get our super screen. >> Yeah. So now let's try running this on simple set. So I'll just give it a few. So all these strings overlap by two, so we should be able to get a nice, short string from this. So you're going to have a minimum overlap of two, and like you would expect those strings all overlap by two so it works out very nicely. All right, so the problem with this is it's not always going to give us the correct answer, right? Mm-hm. >> So let's see if we can find an example where the Greedy method does not give us the shortest string. So I'm going to get us three strings here, minimum overlap of one. >> Okay. >> Okay, so this is the Greedy method. And let's run it with the brute force method and see what the best one for that is. So you can see on this that the shortest common superstring is actually shorter than the Greedy superstring, and the reason for this is to get the shortest common superstring, you can just combine these reads in order, and each one overlaps by two with the one after it. >> Mm-hm. >> But the first string and the third string actually overlap by three. So the Greedy algorithm will initially merge those two together into a larger string, and then it has no option but to just append the middle string to the end since it doesn't have any overlap. >> Mm-hm. >> So it actually does a bit worse by doing the Greedy method here. >> Mm-hm. Right. >> But that trade-off in terms of accuracy, in most cases, will be worth the time you save by doing the Greedy method right? >> Mm-hm.