We've begun talking about treating vision as a computational problem. Our rather rough-hewn first approximation to the vision problem was expressed as this idea. We're going to imagine that we can write a program that will take arrays of pixels which are just arrays of numerical values corresponding to intensity. In this case, we're not even dealing with color, we're just dealing with gray-scale images. But what we want to do is write a program which will take a large array of pixel values. In our first version, we stipulated that the pixel values will be 8-bit numbers between zero and 255, where zero corresponds to black, 255 to bright white, and everything in between corresponds to some shade of gray. What we're going to do is take that array and interpret it as a three-dimensional scene. As we noted already in our discussion from last time, this is a mathematically impossible problem. That is to say, an infinite number of potential three-dimensional scenes could give rise to the very same array of pixels. Therefore, what we have to do is make guesses about the nature of 3D objects in the world, what things we might be likely to see, and we have to employ those guesses in writing our program to interpret these arrays. There's no alternative to it. It could be that this particular array of pixels has been generated by some wild jumble of points out in the world with different intensity values. That's possible. It's extraordinarily unlikely. So one of the things that we begin by doing is making assumptions about the nature of objects in the world, what's likely to generate arrays of intensity values like this. A simple assumption that we all make, that animals make when we open our eyes and are trying to solve the vision problem for ourselves, is that the world largely consists of solid objects with boundaries. So our first job in looking at an array of pixels like this is going to be to try and find where the salient objects might be. What that translates to in programming terms, in looking at an array like this, what that translates to is looking for edges. We're looking for edges between white and black or gray-scale and black. In effect, what we're looking for are very sharp transitions in intensity. So I mentioned last time for instance that little, I'm trying to point to it with my finger, that little rectangle there, that rectangle includes in it an edge. That is a place where the pixel values rapidly go from the white of the cube to the black of the background. That would be a cue for us, if we're looking at this array of pixels, that would be a cue for us that rapid transition, that there might well be, we can't be sure, but there might well be a three-dimensional boundary there. There is a boundary of a three-dimensional object. So our first step in looking at a scene of this kind is to try and isolate where the interesting edges might be. Again, we take into account that there are all kinds of ambiguities and potential errors in doing this, but we're going to make a succession of guesses at what the object might be. Naturally, we have to begin someplace. So let's begin by seeing where we think the edges of the object might be. Now, we do this. It's very clear that we have difficulty interpreting an image if we are trying to find where the edges are. This is a famous, I'm not sure I could call it an optical illusion. It's a famous example of an object that is difficult to interpret if you haven't seen it before. It's interesting. If you have seen this, this is a photograph. If you have seen this photograph before and interpreted it, then you won't have any trouble re-interpreting it the next time you see it. That's already actually something quite interesting about our vision, about human vision. If you have not seen this photograph before, you may have interpreted it right off the bat or may not. I will just tell you that this is a picture of a cow. Now, again, if you have never seen this before, you may well be looking at this picture and thinking, what cow? That was my experience when I first came across this in a book and the caption of the photo was just a cow. I don't know, five minutes or something, I was staring at this thing and going, what cow? There is a cow there. This is an advantage of the video format. What I would urge you to do, because I'm about to spoil the illusion for you, so what I would urge you to do is pause the video at this point and try to see the cow in this photo. It may take you a little bit of time if you've never seen it before. But look for the cow. Do that now, and when you restart the video, I will show you where the cow is in the photograph. Okay. I'm assuming that you've paused the video. Now, we're looking at the photo. Hopefully, you've seen the cow. If you haven't, I'm about to spoil it for you. I'll see what I can do with my finger here. See, here's one of the ears of the cow and there's the other ear. Hopefully, you can see a T, there's her nose, going around there. Then you can see the side of her head and here is my finger right in the middle of her right eye. Now, hopefully, the cow is jumping out at you. Why was this picture so difficult to interpret? It's because the edges aren't very clear in it. So that very first step that we're incorporating into our program; the step of looking for boundaries of objects is particularly difficult for this photograph. Now, interestingly, as I say, once you have interpreted this photo as a cow, you can't unsee it. You can't look at the photo and be confused by it anymore. That's a very interesting cognitive fact about vision. Even more interesting is the fact that if you see this photo perhaps a year or two from now, it won't be a problem for you. You will have visual memory of the photo and you will remember interpreting it as a cow. So if I can call it an illusion, this illusion is now gone for you forever. But there are things to think about in those facts as well. For now, we'll sweep those under the rug and just deal with the issue of finding edges in a gray-scale photo like the one before or like this one. How would you go about finding edges? Now, we're going to use a technique that should be familiar. If you've taken a course in Signals and Systems or Signal Processing, then you may well have seen a technique called convolution. We're going to be using a convolution process to find the edges in images. I'm not assuming that you've seen this term before and I'm going to do my best to explain it here in lecture and there will be additional materials that I'll provide so you can exercise your knowledge of what convolution is. But to begin with, the purpose of convolution is to take a picture like, again, I've gotten this from other sources off the web, but the purpose of convolution is to take an image like the one at the left, which happens to be a full-color image, and to produce from that image a collection of lines that probably correspond to object boundaries or at least points of interests. In other words, you could view what we're trying to do at the beginning as vastly reduce the complexity of the problem. Well, we still have all the information present in that photograph at left, but what we're trying to do initially is just eliminate lots and lots of detail at this point and see if we can just find where the major objects are that we're going to have to interpret. That means finding the edges of those objects. So convolution is the mathematical procedure that we're going to be using to take the image at the left and produce the convolved version at the right. Now, a lot of detail and a lot of potential information is being lost in getting to that point. Again, we still have it on hand if we want to go back and look at the original photo, but what we want to do is take that image at the right and see if we can just begin to find where the objects are. This is the first step in image processing, and we'll view it as one of the first important steps in vision. In fact, edge detection is seen as one of the major tasks of what's called low-level vision. Low-level vision, meaning that it's essentially unconscious automatic. You can't stop it. Just like in that picture of the cow before, once you have seen the edges in the cow, you can't unsee them. So edge detection is not a process completely under conscious control. It's largely automatic and it's tends to be very fast. In cases like the cow, when it isn't very fast, we notice it. But in most cases when we open our eyes, we're not actually making any great effort to see edges in the world that happens swiftly and relatively automatically and unconsciously. So what's this magic process of convolution? Here's the idea. Imagine, think of that photograph on the left as being a scene laid out on the floor, a big two-dimensional image that's like a mat photo that's been laid out on the floor. What convolution does is, it takes, here's the basic idea. You're taking that original image and processing it in a way to look for these rapid changes in intensity. Now, specifically, what you're doing for our task of finding edges is convolving that image with what's called a sombrero function. Now, I'll get to the meaning of convolve but let me just show you what this sombrero function. I think of it more like an orange juicer, think of it as a big hump with a trough all around it. The hump goes above zero, this is a numerical function. So the hump is a positive value, then there's a trough around that hump which has negative values, less extreme negative values than the hump in the center and then it flattens out from that point on. So this is called a sombrero function. I think of it as an orange juicer that is positive, negative trough around and then essentially zero as you get further from the center. Now, this function, what we're going to do is take this function and imagine it being like an object that I can place. I'm going to place that sombrero function over each and every pixel of the image. So if you like, you can think of what I have as a million copies of this sombrero function that I'm placing over each and every pixel in the image. Now, what do I do with the sombrero function placed over a particular pixel? What we're going to do is, multiply the central pixel by the center of that hump and all the surrounding pixels by the values surrounding that hump. In other words, I'm doing my best with a sombrero function. Imagine this is a sombrero. It isn't but, it's my hat. So I'm placing this hat above centered, at a particular pixel in the image and then I'm doing many, many multiplications. I'm multiplying the center pixel intensity by this value in the hat and then the surrounding pixel intensities by the nearby values in the hat. Then the further surrounding pixel intensities by the other values in the hat. So a convolution is actually a multi-part step. It's you're taking one of these orange juicers, placing it above a pixel, doing many thousands of multiplications and adding the result of all the multiplications together to get one final value. So again, let me go through this carefully. What you're doing is taking this sombrero function, placing it above a region centered at a pixel but above a region of the image, doing thousands of multiplications and one big addition. The result of that addition is going to be the result of this convolution process at that one point and you're going to do this at each and every point of the image. Now again, if you're a computer programmer, this may sound to you, like, this is computationally very expensive, isn't it? For each and every one of these million pixels, we're doing a convolution which itself involves perhaps many, many thousands of multiplications and one addition at the end. That's an awful lot of computational work. In the extreme and most thorough version of convolution, that's true. But a couple of things should be noted. First and perhaps most importantly, is that each of these convolutions can be done in parallel. The result of a convolution at this point does not depend on the result of a convolution at a nearby point. So each and every one of these convolutions can be done in parallel. If you have in fact a million of these little sombrero functions, then you can do a million convolution steps altogether like a chorus. You don't have to wait to do them all in sequence. You can do them all at one time. Moreover, realistically, you don't have to do many thousands of multiplications. You can get a pretty good approximation to this convolution step by just doing a smaller number of multiplications and we'll see over time how that works. Once you have done this convolution step, what you are then looking for are places where the result of the convolutions just go from negative to positive or vice versa. The mathematical phraseology here is that you're looking for zero crossings of the convolution step. Once you've found the zero crossings, those are going to tell you where the edges are. So what you're looking at in that right image is the zero crossings of a convolution step. Now, this is again, it's a well-known technique from signal processing and it's computationally actually fairly efficient. There are other kinds of advantages to it. For example, I could just mention a couple. You're looking for zero crossings. The slope of the zero crossing tells you something about the rapidity of the change in intensity, that's one queue that you have. Moreover, the idea of the sombrero function is that it's set up so that if you convolve an image which is uniform, that is to say, if you place your sombrero function over something that is pure white or pure black or pure gray, the result will be zero. The reason that that works out is because you can think of the hump value as having a strong positive bias, the trough as having a mild negative bias but one that is all surrounding the central hump. Then the outer regions as being making a very mild contribution to the result because they're close to zero. So what you're really doing is multiplying the center values by a large positive number, the surrounding values by mild negative numbers and then the outer values by things that values that are so small so close to zero that they're really not going to matter. A sombrero function that obeys that basic structure, there's not just one sombrero function. You can have an orange juicer that has a low but wide hump with a trough around it or a very sharp peak with a trough around it. So there is not just one sombrero function, you can think of sombrero functions as being tuned. A sombrero function that has a very high peak. It doesn't even look like a sombrero anymore. It looks like maybe almost a cone or something with a trough around it nearby. A very wide sombrero function also doesn't look like a sombrero, it just looks like a mild hill with a little gentle valley around it. Those are extremes of the sombrero function and any one of those could be used in this convolution step. In practical terms, the mild sombrero function corresponds to a very rough view of the scene where you're just picking up very large and important edges. A sharp sombrero function picks up more detail. It picks up edges that might be smaller or finer or corresponding to milder changes in intensity. So by using sombrero functions that are tuned to different kinds of edges, you can actually take the very same scene and in a sense look at it in a blurred perspective, like the photo at the bottom left here or in a sharper perspective, like the photo, the image in the bottom right here. That is in both cases, we're convolving this original scene with a sombrero function. But at the bottom left, it's a sombrero function that is broad and blurry. It's like squinting your eyes almost and then at the bottom right, there's a sombrero function that is sharp and picking up more detail. So there's a lot of advantages to using this convolution technique. You can actually use it at different tunings to start by getting a very fuzzy version of the scene and then gradually incorporating more detail as you go along. This is a tried-and-true technique in machine vision, convolution with a sombrero function. What about human vision? Do we do anything like this? It seems like an awful lot of mathematics to be employing just to find edges in a scene. Do we do anything like this? The answer is, to some approximation we do. In fact, our own retina, behind the retinal cells, there is a layer of what are called retinal ganglion cells which in effect perform something like this convolution. Again, this is a diagram that I got from an article, but what you're seeing at the top are a hexagonal group of photoreceptors in the retina. The way you're supposed to interpret it, think of that central red photoreceptor, if you like, responding to high-intensity that is providing a strong signal, it's looking for high-intensity, and it will signal a high response if it gets a high-intensity right at that photoreceptor. The blue photoreceptors around it are looking for, they're sensitive to lower intensity, and they will contribute a mild negative response. That is a mild response if you have low intensity. So think of that entire hexagon with that red photoreceptor in the center. If that area is uniform, if it's all the same intensity, then this is acting like a sombrero function. The red at the center will signal a particular value, but that'll be subtracted out by the values of the surrounding blue photoreceptors. In other words, the red photoreceptor is adding in a positive value for intensity, and the blue photoreceptors are adding in mild negative values for intensity. They want to see low-intensity. Okay. So the result at that ganglion cell in the second layer, if this little hexagon is uniform in intensity, the result will be zero at that ganglion cell. We'll get a positive value coming in from red, mild negative values coming in from blue, and the result will be zero when it gets to the ganglion cell. Suppose on the other hand, that this hexagon is positioned right over an edge that's going let's say right down, again, I can't point with this thing. But imagine a line coming from the back of that array to the front and going right through the center red photoreceptor. In that case, so imagine that there's white of high-intensity at the red photoreceptor and at the three blues to the left, and low intensity at the three blues to the right. That's kind of an approximation of what we're talking about. An edge that goes from white at left to black at right, and the edge is pretty much passing through or maybe is positioned just a little bit right of center of that red photoreceptor. Now, think about what happens then. The red photoreceptor signals yeah, high-intensity. The three left blue photoreceptors are subtracting out mild values. The three blue photoreceptors at the right however are subtracting out very low values. They're looking for low intensity, so they're not going to respond by subtracting out values, they're just going to essentially subtract out zero. The result then is that you have a positive value coming to that ganglion cell. The positive value is something that's going to be the contribution of the red cell minus just a few mild values from the blues at the left, and minus nothing from the blues at the right, okay? Let's try a slightly different scenario. Suppose that the white to black edge is moved a little bit to the left and you're just getting it, the white edge is just passing over those three blue cells at the left and then there's dark values at the red photoreceptor in the center and the three blue cells at the right. In that case, the red cell is not responding. There's low intensity right over the red cell, so it's providing a zero, essentially. The three blue cells at the right are also providing a zero, but the three blue cells at the left are throwing in a mild negative value, all three of them. The result will be that the ganglion cell gets a negative value. It gets a negative value because the three blue cells at the left are signaling high-intensity which we don't like and subtracting that out. The red cell is seeing low intensity and not supplying anything. The blue cells at the right are seeing low intensity and not supplying anything. So the ganglion cell in that case is recording a negative value. The purpose of this very intricate discussion is just to show that the mild positioning of an edge, a white to black edge over this set of photoreceptors in the retina can result in a shift from negative to positive in the ganglion cell in the layer right beneath. That's a very good approximation to what the convolution step is doing. It's taking a central value which is positive, and subtracting out mild negative values corresponding to the intensity surrounding that central value. So this arrangement of photoreceptors in the human retina with a layer of ganglion cells behind them is similar in function to the convolution step that we've been talking about. This is an interesting case where thinking of vision as a computational problem can allow us to look for functions in the biological eye, and infer the purpose of those functions in the biological eye. So it makes sense that you would want very early in the vision process, as in this case just about as early as possible, you would want some techniques built into the eye that would respond to the positioning of edges where they can. We'll continue on with other correspondences between the biological and computational vision problem as we go along.