We will now talk about one of the alternative technologies for DNA sequencing called DNA chips or DNA arrays. When Sanger assembled the phi genome from 500-nucleotide long reads in 1977, scaling this to the human genome would be extremely expensive and would likely fail due to unresolved algorithmic challenges. Nevertheless, US government in 1984 started to plan the Human Genome Project, that another 16 years later, in 2000, resulted in the draft sequence of the human genome. But at the same time, three scientists, at three different countries, thought about an alternative technology for DNA sequencing and they invented so-called DNA chips. Here's the main difference between DNA chips, or DNA arrays technology, and Sanger sequencing. In Sanger technology, biologists generate some, but not all, long reads sample from the genome. And the read length is approximately 500 nucleotides, that was the read length that Sanger used. In the DNA chip technology, biologists generate all short k-mers from a genome. But k is much smaller in this case. Instead of 500, it may be, in the original DNA chips paper it was proposed that k equal to 10. Ideally, a DNA chip would generate a k-mer composition of a strand, a multiset of k-mers that are present in the strand like shown here. Please note that some k-mers in this multiset appear multiple times, for example, ATG appears three times. Our goal is to reconstruct the original string from this k-mer composition. And please note that also I ordered this 3-mer in the order they appear in the string. In reality, I don't know the order. Let's try, nevertheless, to assemble this 3-mer into the string, and we will represent every 3-mer as a vertex in the graph. And this is a graph that corresponds to the string. And the question I want to pose is, can you construct this genome path, so path that spells the genome, without knowing the genome? Only from its composition. Well, if we simply connect two k-mers, if suffix of the first k-mer Is equal to the prefix of the second k-mer, then we will get this path. For example we connect TAA, we will use AAT. Because TAA ends in AA and AAT starts in AA. However, if we introduce all edges based on this principle, then of course, we will construct this genome path. But in addition, we will also have to connect some other vertices with each other. In fact, many vertices, and many more. And as a result, we will get a graph like this. Where is the genome path? Well, it is still the same horizontal path, but remember the reality is that the order of these 3-mers in the genome is unknown, and therefore if we order them lexicographically, this is the graph that we get. Where is the correct path in this graph? Let's try to reconstruct this string. I will even give you hint, let's say this string start with TAA. Then we look at the vertex TAA, and we find a vertex that is connected to this TAA by any actions, in this case it will be AAT. Then we'll continue and continue and continue and continue, Like this. What are we trying to do? So we are trying to extend the path, but what is the problem that we are trying to solve while doing this? And please note that from every vertex I visit, there are often multiple choices of the next vertex. Well, if we continue further, then you will see that the problem we are trying to solve is finding a Hamiltonian path in the graph. Or a path that visits each vertex exactly once. [MUSIC]