In this video, we're going to go into deeper detail in regards to regularization to gain a deeper intuitive understanding of how it works. In this section, we'll be covering the details of regularization and some different approaches to understanding exactly how regularization works. Here we'll examine intuitively how regularization works. The goal here is to shed additional light so that regularization approaches don't seem as much of a black box. We're going to go through a few approaches to adding intuition to this process. First, we're going to go over the analytical view to give a logical review of how we're achieving our goal of reducing complexity. We'll then go over the geometric view, which will eliminate further the actual optimization problem as well as showing why LASSO generally zeros out coefficients and Ridge does not. Then finally, we'll go over the probabilistic view, which will show how we can recalibrate our understanding of LASSO and of Ridge as a base problem, where our coefficients have particular prior distributions. Let's start off with the analytic view. The analytic view is going to present the obvious. As we incur L2 and L1 penalties, we force our coefficients to be smaller, thus restricting their plausible range. Think a smaller range for coefficients must imply a simpler model with lower variance than a model with an infinite possible coefficient range. Think when we eliminate features, this is clear how we're quickly eliminating that variance as we can just think of the difference between the possible solutions that are available for y as a function of x versus as a function of x and x squared. In regards to just shrinking coefficients, so not necessarily eliminating coefficients, we can think about how much that y variable actually changes in response to a feature. If the coefficient is close to zero, that's essentially saying that that feature has almost no effect. Whereas if the coefficient is large, a small change in that feature will have a large impact on our outcome variable. Thus, it will be higher sensitivity to a change in that feature and thus a higher variance in that underlying model. Now let's discuss the geometric view. First, we're going to go over the actual optimization problem and then slightly reframe them so we can understand it from this geometric standpoint. For Ridge, we're trying to minimize the error that we see here in that curly brackets, which is just the normal OLS optimization problem, that sum of squared residuals that we're trying to minimize, but it's going to be subject to our coefficients being as small as possible. No matter what solution we come up with, those coefficients squared in this case will end up being equal to some value, and here we're setting that value to S. We want to subject it to a value less than or equal to S, so we're minimizing that. Once we do, that Beta squared is going to be equal to some value, and this will become clear as we get into the geometric graph. Then the same is going to hold true for LASSO. Just this time, we're minimizing our error subject to the absolute value of our coefficients being less than this value S. This formulation of breaking down the cost function, the optimal solution will have to be found at the intersection of that penalty boundary, which is just the penalty that we get for each one of our coefficients, whether that's Beta squared or the absolute value of the different Betas as well as the contour of the traditional OLS cost function. We see that here that we have the different contours, the red lines being the cost of the OLS function and the diamond or the circle representing the penalty for ridge versus LASSO. The geometry of this will be why the selection of LASSO will tend to zero out certain coefficients. Let's go into a bit of detail here. When we look at the graph, we see that the possible coefficients for Beta_1 and Beta_2 that can solve our optimization problem. When we see these red lines, those are supposed to represent the set of solutions that would have the same error on our traditional OLS cost function. If we weren't constrained by LASSO or ridge, the optimal solution would be that Beta in the middle. But each one of those outer rings are going to be a set of different Betas that would all end up producing the same error on our classic OLS function. Think for every value on that inner ring that would lead to a squared error of 10. In that second ring, all those values of Beta would produce a value of or an error of 20 and so on. On the other hands, no matter what Betas we choose on that ring, we now also have the associated regularization term that we have to deal with as well within our cost function. We want to ensure that since every point on that ring has the same error for traditional OLS, we want to ensure that we choose the point that will also minimize our regularization terms. This is the idea why they will have to intersect in the first place. Because no matter where we choose on that ring, we will automatically have the associated Beta_1 and Beta_2, which is just going to be our penalty. No matter what values of Beta we choose, there will be that additional associated costs for LASSO that was the absolute value of Beta_1 and the absolute value of Beta_2 sums. Looking at LASSO, we see that if that value is equal to one, let's say, whatever that penalty is of absolute value of Beta_1 and absolute value of Beta_2, we will end up with a diamond, which represents all the values for Beta_1 and Beta_2 for which that sum of absolute values is equal to one. If we look all the way on the right-hand corner, then we have Beta_1 is equal to 1 and Beta_2 is equal to 0. At that top corner, we have Beta_2 equal to 1 and Beta_1 equals 0. When you add those together as those will both equal to one, and in the middle, you'll create a straight line leading from that 1, 0 to 0, 1 where you'd have something like 0.5 plus 0.5 or 0.25 plus 0.75 for Beta_1 and Beta_2. All those equal to one. All of those terms are equal to the same value, and when you plot that out, you end up with this diamond. These diamonds on the LASSO side, for it to intersect with that contour and for us to minimize that value, it will have to hit one of those edges. Unless that contour is directly parallel to our function, we'll have to hit one of those edges and each one of those edges means that we are zeroing out one of our coefficients. This follows for higher dimensions as well. Then when we look at the ridge regression, that intersection can happen at any point because for the Beta_1 squared plus Beta_2 squared to be equal to 1, we don't have this pointy diamond, but rather we have a circle. To find the values, everything on that outer circle is equal to one, and therefore it will be equally likely to hit any point on that circle given our contour. That's why we will just be minimizing values by using ridge, but actually eliminating values when we use LASSO.