Hello, my name is Michael Kapralov. In this lecture I will talk about finding heavy hitters in data streams. So this lecture is about an algorithm in the so-called streaming model of computation. I will start by defining the streaming model of computation for you, and I will define the problem that we want to solve, that is, the heavy hitters problem. I will talk about some of the applications and then proceed to the actual technical content, that is, design a very space efficient algorithm for this problem. Well, the streaming model of computation is quite old by now. It was first defined in a foundational paper of Alon, Matias and Szegedy in the 1990s. In this model of computation we think of the algorithm as observing a very long steam of data items. And data items could be anything. It could be IP packets, tweets, search queries, whatnot. So as the algorithm is scanning this very long stream of data items, its task is to maintain the small space summary of the stream seen so far. From this small space summary, we're supposed to be able to query some basic statistical properties of the stream. Informally, we think of the algorithm as having a single pass over a stream of data items. We'll denote the data items by i1, i2, through iN. Capital N will be the length of the stream, and we will assume that the length of the stream is unknown. We typically think of a streaming algorithm as having small, or sub-linear storage capacity. So the space available to the algorithm is substantially smaller than the stream that it's working with. And we typically think of this as n to the alpha for some small constant alpha less than 1. Or, which is the setting that we're in for this lecture, as polylogarithmic in the length of the stream. Okay, well when I talk about storage capacity, or space availability to the algorithm, what exactly do I measure space or storage capacity in? Well, we could use bits, bytes, words to measure storage capacity. In some settings, we could also use points or data items. Okay, so a streaming algorithm has to have sublinear space complexity. At the same time, we also want streaming algorithms to be fast. That is, we want to have small processing time per element. It turns out that in many cases and for many problems, it is necessary to use randomization in order to achieve sublinear space complexity. And indeed, this is the case for the problem that we're considering today. So today we will talk about the so-called heavy hitters problem. And, formally, this problem was defined as follows. The algorithm is given a single pass over a sequence of data items, i1, i2, i3, through iN. So N is the length of the stream. After the algorithm processes the entire stream, its task is to output the k most frequent items that occur in the stream. So these k dominant most frequent items in the stream are known as k heavy hitters. Now of course, we want our algorithm to have small space complexity. It cannot just store the entire stream, and find the top heavy hitters. Specifically in this lecture, we will design an algorithm for the heavy hitters problem with space complexity on the order of k times log N, k times the logarithm of the length of the stream. So for example, if the number of heavy hitters that we're looking for is small, say k is equal to 1, then the space complexity of our algorithm is logarithmic in the length of the stream, which is exponentially better than what the trivial solution that just stores the entire stream would have to use. Storing the entire stream, of course takes space liner and the length of the stream. Let me give you an example for this problem. And here's a representation of the stream that I will use in the rest of the lecture. So, on the slide, at the bottom of the slide, you see the data items arriving one after another in a stream. We'll associate data items with numbers between 1 and some large number N. For my examples on the slides, I will use numbers between 1 and 10. So the bottom line on the slide shows the data items arriving in the stream. In the middle of the slide, you see the frequency histogram for these data items being updated on the fly. So for example, at this point, item 3 arrived twice. So we see frequency 2 in the histogram. And items 4 and 6 appeared only once. So our algorithm is observing this very long stream of data items, and at the end of the stream needs to output the most frequent item that it has seen so far. And in this particular example, the most frequent item is item 3. It occurred six times in the stream, whereas all of the other items occurred at most five times. In the next video, I will talk about some of the applications of this heavy hitters problem. And the applications will be to monitoring network traffic and estimating query statistics.