We've now seen a slow algorithm for finding the shortest common superstring, and we've seen a faster, greedy algorithm for finding a superstring that isn't necessarily the shortest, but is probably pretty close. In this lecture we're going to look at a down side of both of these approaches, and this is going to lead us to the third law of assembly. Which is the most frustrating of the three laws. And really, the point of this lecture is that, the shortest common superstring, when the genome is repetitive, the shortest common superstring of the reads is not going to be the correct answer. So let's see what we mean by this. Here's a specific example. So here, we use the greedy algorithm, which is introduced in the previous lecture. And the input strings are all the 6-mers from a string that's written at the top. And we're using English for this example, not DNA, but that's just to make things a bit clearer to visualize. The point we're making is the same for DNA. So say we run our greedy algorithm and again we'll visualize rounds of this algorithm in this simplified way, where each line corresponds to the labels of the nodes that remain after each round of merging. So the strings merged in each round are also highlighted in red so you can see exactly what's happening from round to round. So now let's look all the way at the bottom. Take a look at the final superstring that we get at the end of this algorithm. And what do notice about it? Well, it's actually shorter than the original genome. The original genome had three copies of the word long. And this final superstring that we obtained with the greedy algorithm had only two. One of the copies disappeared. So why did that happen? Well, the algorithm did exactly what it was supposed to do, right? It did exactly what we asked it to do. It didn't make any mistakes as far as the shortest common superstring problem is concerned. The problem here is that finding the shortest superstring isn't really what we want in this case, because of the repetitive nature of the genome. When the genome is repetitive, finding the shortest superstring will tend to take the repetitive portions of the genome, the repeat elements, the transposable elements, for example, and collapse them down into fewer copies than should really be there. So another way to look at this is like this. If we took these same 6-mers that we were using with the greedy shortest common superstring algorithm, which are from a sequencer. If we took these 6-mers and we assume that the 6-mers are just reads that are sampled randomly from the genome so they could come from random locations. Then if we look at the reads, we can't tell exactly how many copies of the word long there are in the original genome sequence. The reads we have are actually consistent with various numbers of copies of the word long. So, we can tell that there are at least two copies. That we can tell, but we can't necessarily tell whether there are two or three or five copies, or some other number. The reads just aren't long enough to tell us exactly how many copies of the word long must be present in the original genome. So shortest common superstring is always going to go with the fewest copies required to explain the reads, and in this case, that's just two copies. So this is our first encounter with an issue that's probably the single most important issue that makes the assembly problem difficult in practice. Repetitive sequences can make it difficult or even impossible to correctly assemble the original genome. Repetitive sequences cause ambiguity, and just like if you're trying to put together a puzzle and the puzzle is like half featureless sky pieces, just like in that case, it's not always going to be easy or possible to resolve the ambiguity. So the third law of assembly states, repeats make assembly difficult. When the genome is repetitive, our attempts to assemble it will fail in some way, shape, or form. And the way in which we fail will actually depend on the particular algorithm that we're using to solve the assembly problem. We just saw this issue crop up with our shortest common superstring problem. When you look for the shortest common superstring, and it doesn't matter whether you're using the slow algorithm or the greedy algorithm, you're going to get this phenomenon where if there are many copies of a repeat, you'll accidentally collapse them down into fewer copies of the repeat than are really there. Other algorithms, when they fail, will make different kinds of mistakes. So for example, we might take two very distant portions of the genome and be tricked into sort of switching them around because of some repetitive sequence that they fell between that made it impossible for us to know exactly where each copy went. Another point to make about how important this issue is, is that you'll recall from a previous lecture, where we talked about transposable elements and how much of the human genome is covered by them. About half the genome or so is covered by repetitive DNA sequences. So, you can imagine that that makes the assembly problem very difficult for genomes, like the human genome, and other genomes that are very repetitive. So the way that we've stated the third law of assembly here is a bit imprecise and we'll refine it a bit later in a couple of future lectures. And in one those lectures, we'll discuss an alternative to shortest common superstring, which is called the de Bruijn graph and the Eulerian walk method. That's an alternative that actually avoids the over collapsing problem. So it avoids this particular problem. But it still can't avoid the overall problem that repeats make assembly difficult. And in another later lecture, we'll see how real assembly software deals with this issue. And then in another lecture, we'll look a bit closer at whether new sequencing technologies might be able to help us sidestep this problem.