We just saw how to create a de Bruijn graph, and we also learned what an Eulerian walk is. It's a walk through the graph that goes from node to node and that crosses each of the edges exactly once. And we said that this corresponds to a reconstruction of the original genome, and so it gave us a new algorithm for reconstructing the sequence of a genome from a collection of sequencing reads. So let's apply this idea in a couple of scenarios, and let's see how it works in a couple of more complicated examples. And we also want to see whether it actually solved the issues that we had with the shortest common superstring formulation. For example, the shortest common superstring had a tendency to overcollapse repeats, so we're going to see if we have the same problem here. Here's a de Bruijn graph, for a familiar example. This is the genome that the shortest common superstring had trouble with. So we collapsed the three copies of the word long into just two copies. Well, when we draw the de Bruijn graph for this string, for this genome, using a camera length of five, and assuming that the sequencer sequences each camera exactly once, then we end up with the picture that you see here. We said previously that de Bruijn graphs produced in this way are Eulerian, so they have an Eulerian walk, and so if we look at this graph we ought to be able to find a walk through it that reconstructs the original sequence. And we can. So here it is. We start here, and we spell out a long, long, long time. So there you go, that's the Eulerian walk through the graph. So again, the Eulerian walk did what we said it should. It gives us the reconstructed genome sequence back again. But the point is that it, actually in this case, did not overcollapse that repeat. It did not overcollapse the repeat long_long_long. It gave us the original sequence back, and that's an improvement over the shortest common superstring formulation of the problem. So that's great. But let's not be lulled into thinking that we've solved all our problems. So we actually can't escape from the third law of assembly which states that repetitive portions of the genome make assembly difficult. And let's see exactly how they make assembly difficult when we use an algorithm that's based on the de Bruijn graph, like we are here. So here's another example, and this time things won't be quite so straightforward. So the de Bruijn graph that you see on the right is the graph that corresponds to the genome on the left using a camera length of three, and as always we're assuming that each camera is given to us exactly once. And this graph on the right is Eulerian like we said before because there's a way to walk along the nodes of the graph such that we cross each edge exactly once. However, there is more than one way to do so, so there's more than one Eulerian path for this graph. So one of those ways looks like this. Let's start in ZA, and we're going to start by going down to AB. And then we can go up to BE, and then along here, and then down here and along here, and then down to BY. And that's one of the Eulerian paths. The other one goes like this. We start from ZA, we go to AB, and we start by going down to BC and going around like this. Then we go up to BE and go around like this. Then we go down to BY. So here are those two walks that we just took, the series of nodes that we visited along those walks. So we found two different ways of walking along our de Bruijn graph in order to try to reconstruct the original genome. So there's ambiguity. We haven't escaped the third law of assembly here. Instead, the way that it manifests itself is, it means that the de Bruijn graph for repetitive genomes will, in general, have many different Eulerian walks, and only one of those walks actually corresponds to the original genome sequence. All the rest of the walks correspond to incorrect reshufflings of the genome where portions of the genome that occur between repetitive sequences are going to get shuffled around and put in the wrong order. So in this case, again, repeats made assembly difficult, and specifically the repeat that gave us trouble was the fact that this, the tumor AB appeared in three different places in our genome sequence. Okay. So let's see a slightly larger example of essentially the same idea. So, here I'm taking, I'm showing some Python code that takes a genome. The genome's shown at the top in blue. It's in English. And then we build a de Bruijn graph using a camera length of four, and then we call a function of the de Bruijn graph class here which is called eulerianWalk which simply finds the Eulerian walk and just returns basically a list of all the nodes visited during that walk. So, we won't have time to talk about it here, but the implementation for eulerianWalk is actually very simple. So eulerianWalk is a very simple and efficient function. So, this next line concatenates together the labels of all the nodes visited over the eulerianWalk, which is a very simple thing to do. It respects the overlaps between them so it actually starts by appending the first node label, and then it appends one more character for each subsequent node visited. And then it simply prints out the superstring that it's put together. So in this case, you can see that the answer that we get is actually the real genome. So this is the genome that we started with. So we get the correct answer back. So that's great. So now let's repeat this entire experiment but using a camera length of three instead of a camera length of four. And by decreasing the camera length, we're increasing the chance of being affected by repeats since the smaller k means there's more likely to be multiple occurrences of any given k-mer in the genome which is going to increase the chance that we have multiple different Eulerian walks for the same graph. And indeed, as you can see all the way at the bottom here, our reconstruction of the genome is incorrect, so it mixes up the order of these words that start with T right here. All right? So this is the curse of the repetitive genome which in the case of the de Bruijn graph it results in a spurious reshuffling of portions of the genome like you can see here. So now back to one other issue that we've been putting off, our assumption about the sequencing data. So we've been assuming that the sequencer is kind enough to give us exactly one read per k-mer, and without accidentally leaving out any k-mers, or giving us any k-mers twice, or anything like that, or without sequencing errors, and things like that. But of course none of these assumptions are actually true in practice. The sequencer gives us sequencing reads, which are usually going to be much longer than the value of k that we pick, by the way. So typical values of k are around 30 to 50 or so, whereas sequencing reads are usually around 100 to 200 or tp 300 or so bases long. But of course, the sequencer does make mistakes. There are sequencing errors. The coverage is uneven. It's not necessarily going to give you the desired number of copies of any particular bit of the genome. So given this reality, we can certainly still use our procedure for building the de Bruijn graph. In other words, we can take each read and chop each of the reads up into k-mers, and then for each k-mer split it into it's two k minus 1-mers and add the corresponding nodes and edges to the graph. We can still do that, but what we end up with is a graph that's not necessarily Eulerian. In fact, it'll never be Eulerian in practice. So, for example, if we have this genome, and the sequencing reads reported by the sequencer are the three reads that you can see here, and we're going to use a k-mer length of eight. So when we chop those reads up into k-mers we get the k-mers that you see down here. So we can still build a de Bruijn graph out of these k-mers. Right? We can build a de bruijn graph out of any set of k-mers. Then we can do that, and this is the graph that we get. And if you look at this hard enough, you can figure out that this graph is not Eulerian. There's no Eulerian walk of this graph, but it's almost Eulerian. I mean, if you go like this, if you go a long, long, long, time, you'll notice that it was almost Eulerian. The problem here was right here, the number of edges going from this node to this node. So finding Eulerian walks will not necessarily solve the assembly problem for us. Now that doesn't mean what we just investigated was pointless. So for one thing, it was an instructive example of how repeats make assembly different no matter what algorithm we use, and we solved two different algorithms in two different ways that the repeats made the assembly problem difficult. But, another reason is because this de Bruijn that graph we studied, the de Bruijn graph representation is actually a very common and useful way to represent assemblies, to represent the assembly problem, and in fact, many modern software tools for assembly use de Bruijn graphs as the internal representation of the relationships between the sequencing reads. So while shortest common superstring and Eulerian walk are both flawed formulations of the assembly problem, the overlap graph and the de Bruijn graph are both going to continue to be very useful to us in practice.