This lecture module is about SIFT, SURF, FAST, BRIEF, ORB and BRISK. What are these? These are augmented reality, feature detection, and description methods. We're going to compare them. These techniques are the most famous augmented reality techniques. So, if you have used augmented reality on your smartphone, on your glass, on your head mounted display, on your laptop, tablet, PC or computer, windshield or anywhere, then most likely you have used these. So, you might as well learn about them and that's what we'll do in this lecture. First, we will compare them. Here are the six top most popular ones and we will see why. In addition, we will see why they have evolved from one another. Think about it. If the first one that was developed was perfect in every way, then there would be no need to create more. However, evidently, it's was not perfect enough and that's where the technology evolution kicked in. Starting from here you can see SIFT, SURF, FAST, BRIEF, ORB, and BRISK. These are the most popular ones. Look at the year of their creation. From 1999 to 2006 for SURF, then there's FAST in 2006, BRIEF in 2010, ORB 2011 and BRISK in 2011. I want to first say about SIFT because sift had very good features and the characteristics of SIFT were highly desired. However, it had some characteristics in which the computation burden was very high. Even for the time that it was created in 1999, the computation burden was extremely high. But then, think about the modern-day cameras that are on your smartphone. The resolution is even much, much higher and it grows every year as new smart phones come out. As the images, as your cameras have higher resolution, the burden of processing SIFT became even more burdening. So therefore, let's find a way to get the good advantages out from SIFT. However, find a way to make it much less computation burdening such that we can save energy, we can speed it up and use it for our augmented reality devices. So, we will focus on some of the good things and bad things of SIFT and see how the technology evolution process trend is going on. Feature detector. Well, number one, SIFT uses difference of Gaussian, and later on I'm going to be talking about dog, D-O-G. That's exactly what D-O-G stands for, difference of Gaussian. Then SURF uses fast hessian. Over here in FAST uses binary comparison, and ORB uses FAST. ORB is a combination of FAST and BRIEF together, where BRIEF is used in a rotated fashion. However, in ORB, the feature detector will borrow the FAST technology. Then in BRISK, this uses FAST or AGAST. AGAST is an evolved, improved, fast technology. So, it's FAST or it's improved FAST with this AGAST, and we'll talk about more about this. Spectra wise, look at this. You can see that SIFT uses local gradient magnitude, where SURF uses integral box filtering techniques, and over here BRIEF, ORB, and BRISK use local binary. The local binary process is an approximation method but it has a lot less computation, needs, and the results are close enough to the real of what we want to do. Then comes the orientation. Does it use orientation in processing? The answer is yes for all, but you'll see a non-applicable and a no for BRIEF and FAST. So therefore, all the others have the orientation processing mechanism included. Then, feature shape. Looking over here, SIFT as well as BRIEF, ORB and BRISK they use the square, and SURF will use a HAAR rectangles. Then we have the feature pattern in which SIFT will use square, SURF will use dense techniques, and you can see that the BRIEF will use random point-pair pixel technology, and ORB and BRISK will use trained point-pair pixel comparison. Distance function is the next thing that we will look into. For SIFT and SURF its Euclidean, whereas for BRIEF, ORB, and BRISK, it uses the Hamming distance. So, there's a Euclidean distance and the Hamming distance. What is the difference here? The Euclidean distance is based on the actual distance between two XY points where you're going to use triangulization as well as Pascal algorithms to find the actual direct line distance of these. The Hamming distance is used for binary operations in which you look at a binary code word and another binary code word and the number of differences in bit positions of ones and zeros is what becomes the reference to determine the Hamming distance. So, how many bit positions are the same and how many bit positions are different is what the Hamming distance will be determined upon. So therefore, for binary operations, Hamming distance can be used. Therefore, Hamming distance is different from the Euclidean distance. Robustness. Look at the numbers on top which are six, four, two, three, four and they are different. Showing a higher robustness over for SIFT, and lesser numbers over here for the other algorithms. You can see why it's six for SIFT as in terms that a brightness, rotation, the contrast, a fine transformation, scale, and noise, it's robust against all of these features. That's cool. Whereas it comes to SURF, you'll see that there's robustness in scale, rotation, illumination, noise. Then, you come into BRIEF, you'll see for brightness and contrast robustness does exist. Then for ORB, there is brightness, contrast, rotation, and limited scale. Then over here for BRISK, you see the four features which are brightness, contrast, rotation, and scale. Overall, you can see that SIFT has very robust features in a wide-scale and that makes it very attractive. However, the heavy computation mechanism of SIFT is why other techniques that will use approximation, so that you don't need to get all that processing done to get the results. We will get good enough results with much less computation and that is where the approximation techniques are contributing. Overall, then comes the pros and cons. The good things and the bad things. The pros are like this. For SIFT and SURF there very accurate. However, SIFT has heavy computation needs, and therefore is slow, and SURF is also relatively slow as well. The other thing about their cons is that they're both patented. So, in order to use them you need permission and also you need to pay patent fees. On the other hand, you can see that there's fast which has a large number of interest points that are generated, and maybe too many are generated. Therefore, it is a con. On the other hand, you can see that FAST, BRIEF, ORB and BRISK, they are all fast and can be used for real time. That is why they were created to be used in real time because augmented reality in real time that is where the true augmented reality technologies are featured in real applications and devices that make them so attractive. So, therefore, on the top, when you see the benefits of fast real-time applicable, these are really important for augmented reality systems. But going down here, you can see that ORB and BRISK are less scale invariant. In addition, when you go into BRIEF, this is scale and rotation invariant. These are some of the characteristics that you need to consider in selecting what technique you're going to use for what. I know this is not sufficient to look at these six most important augmented reality, feature detection and feature description techniques. That is why we are going to go into details on all six of them, in this order, in the following lecture. Hang in there, that's what's going to come right next. We will start with SIFT. In SIFT, this stands for Scale Invariant Feature Transform. This is one of the first feature detection schemes that had been proposed. It uses image transformations in the feature detection matching process. SIFT characteristics include that it's highly accurate, which is wonderful. However, the large computation complexity limits the use in real-time applications and this is an issue. Looking at the process that you see right here. The table that you saw in the former lecture as well as the figures that I'm showing you right here are from the PhD dissertation at Yonsei University of Dr. Yong-Suk Park, who is my grad student, and we work together in this research for several years and I proudly I'm showing you some of the results that were accomplished and are shown in his PhD dissertation. I'm very proud of my grad students that are doing research with me and had doing research with me together in the past to accomplish new technologies. So, I'm using this proudly and here we go. In the overall SIFT feature, you can see the DoG generation on the top. DoG stands for this right here and we will talk about this in the following page. It's based on Gaussian techniques and we are extracting the differences of this and after that process goes into key point detection mechanisms. As you can see here, the key points are based upon these different levels of an image based on different scales that are produced, and finding maximum or minimum points, the extrema points. This follows by orientation assignment techniques, which you see uses this, which I will show you soon in further details, and then there's descriptor generation techniques. Then, it goes down into down sample of the image and then looking at the octaves and you will see about the details of this in the following page. Difference of Gaussian generation, DoG regeneration, builds a scale space using an approximation based on DoG techniques. So, if you look at SIFT and the following SURF, scale space is used as the core technology to start the augmented reality process. Local extrema of the DoG images at varying scales are selected as interest points. These local extrema, what is a extrema? A local is within an image, there are local ranges that you can divide it into based on characteristics of the image. Then, in each image there are extreme points which are the extreme as in terms of being either a maximum or a minimum point in that local range. That's what we mean by local extrema and these will be the key points that we're looking for using the DoG images at various scales and these will become our interest points. DoG images are produced by image convolving or blurring with Gaussians at each octave of the scale space in the Gaussian pyramid. So, we are going to use the scale space and Gaussian pyramid in which in the scale space we have abstract scale space and coming down to the full scale space. Stacking them up it will look like a pyramid of image based upon Gaussian processing. The processing is involving convolution or blurring techniques, and these Gaussian images based on the number of octaves is down-sampled after each iteration. Why use DoG? What is this doing? Well, a DoG is a LoG, Laplacian of Gaussian approximation method. So, what we're really saying is that we would want to use a Laplacian of Gaussian, but instead of that we're using a simpler, less computation burdening DoG technique. A Laplacian of Gaussian? Why do you want to use a Laplacian? What is the benefit of it? Hang in there, I'll explain it soon. DoG has low computation complexity compared to LoG. In addition, DoG does not need partial derivative competition like LoG does. So, automatically from the statement, you see that LoG is very computation burdening, and it is computation burdening because it uses partial derivatives. DoG obtains local extrema of images with difference of Gaussian techniques. Interest Point Detection classical methods commonly use LoG technology. LoG is a very popular approach to use it because in Interest Point Detection, LoG is scale invariant when it is applied to multiple image scales. The characteristics of a Gaussian scale-space pyramid and Colonel techniques are frequently used and this is just a conceptual example of a pyramid of images at different levels. This is not exactly what you will see as a result, however it gives you an idea of what you mean by a pyramid of images at different scales. Approximation of Laplacian of Gaussian. What is the Laplacian process doing? It is a differential operator of a function on Euclidean space. Once again, remember that SIFT and SURF are based upon the Euclidean space, the Euclidean distance, where the other algorithms that were faster and more appropriate for real-time were based on the Hamming distance because they are based on binary operations. Here, this scheme has a second order Gaussian scale space derivative based process and this series is sensitive to noise and requires a lot of computation. This is one of the reasons that we want to avoid the Laplacian process, and that is why we want to have approximation methods. That is why DoG technology is used instead of LoG technology in SIFT. SIFT processing steps include the key point detection where key point localization process is conducted. Each pixel in the DoG image is compared to its neighboring pixels. So, each pixel is compared with its neighboring pixels, so that's what it means by a localization process. What are you trying to do? You're trying to do key point detection. Comparison is processed on the current scale and two scales above and two scales below, and that is where the comparison process is conducted. Why? To find key points. Candidate key points are pixels that are local maximums or minimums, the extrema points are the candidate key points and a final set of key points will exclude low contrast points, because if you're a low contrast point you may have air inside. You may not be accurate. So, even though the value is a maximum or minimum in its local range, if the contrast is low, you may not want to use it because it may be faulty. Then we have the orientation assignment procedure where key point orientation determination is what we're trying to do. Orientation of a key point is the local image gradient histogram in the neighborhood of the key point. What is the gradient doing? The gradient is showing you the rate of change. The histogram is a structure in which it's a graphical technique that shows you where the amount of change is largest and smallest. So, therefore, looking into this, we can find the orientation of a key point based upon gradient histogram techniques in which the peaks in the histogram, the gradient histogram, are selected as the dominant orientations because that's where the most changes are detected. How are they detected? Using the gradient technique. The SIFT descriptor generation is based on the computation of the feature descriptor for each key point and the feature descriptor consists of 128 orientation histograms. In the figure below, you can see the image gradients, the key point descriptors and the 128 features for one key point as a sequence of this operation that is used in SIFT. These are the references that I used and I recommend them to you. Thank you.