In this video will be concerned with the justification for using the least squares procedure, and we'll really state two different justifications. One will be the Gauss-Markov theorem. So this is a theorem that tells us that under certain conditions, the least squares estimator is best in some sense, and so we'll explore that in just a minute. The second justification that will look at is the fact that again, under certain conditions, the least squares estimator is equal to the maximum likelihood estimator for linear least squares. And that's nice because maximum likelihood, as I hope you know from a previous course, has some really nice properties. It's asymptomatically unbiased, asymptomatically efficient, it's consistent. And so it's got these nice properties that make it a good estimator. So first, let's look at the Gauss-Markov theorem. So the Gauss-Markov theorem says that under the four assumptions that we stated at the end of our last video. Namely, that the air terms are zero in their mean or expectation, that the expectation of each of the response terms is just the linear model itself, right? The the linear term. That the errors are uncorrelated and they have constant variance. And that the x transpose x inverse matrix exists. So under those conditions, the Gauss-Markov theorem basically says that the least squares estimator is the best unbiased estimator of beta. Now, what this means is that among all unbiased estimators of beta, it has the lowest variance. And think about why that might be nice. So often when we think about estimators, we're trying to balance two different things. One is that, something having to do with the bias. Which means, on average, does the estimator get close or exactly onto the true set of parameters? And so the least squares estimator is in fact, unbiased, which means, on average it is the true set of parameters. So if you take the expectation of the least squares estimator, you get the true set of beta values. And the second thing that we try to balance bias with is variants. So it's somewhat useless if we have an estimator that's unbiased, but it has a really large variance, which means it can vary very much from sample to sample, dataset to dataset. So what we would like is something that is unbiased or close to unbiased, but also has a low variance, which means that from sample to sample, the value that you get for the estimator doesn't change all that much. Now, what the Gauss-Markov theorem is saying, is that if you just restrict yourself to unbiased estimators, so only the class of unbiased estimators, then the least squares estimator has the lowest variance. Now, this is a great result. It's not the the final answer to the estimation problem. And that's because, theoretically at least, it's possible for you to have an estimator that is just slightly biased, so it doesn't fall in the class of unbiased estimators, but has a much lower variance. And so there are certain procedures, like regularization procedures in statistics, that look at other estimators for certain contexts of linear regression. And the regularization procedures will introduce a little bit of bias into the estimation procedure in order to reduce the variance. And so what that would mean is, if you had maybe one of these assumptions violated or close to violated, then you might want to check out one of these procedures. But for us, if we don't have these assumptions violated, it's important to know that our least squares estimator has the lowest variance among the class of all unbiased estimators. So the second justification for using least squares is that it actually matches the maximum likelihood estimator. And this is important for statisticians who think that the maximum likelihood estimator is really the best procedure to use for estimation problems. And there are good reasons to think that the maximum likelihood estimator is at least a good choice. And that's because it has certain nice statistical properties, like in the limit, it's unbiased, in the limit, it has the lowest variance, it's consistent, and so on. So in order to show that the least square solution is the same as the maximum likelihood estimator, what we'll do is we'll start out deriving the maximum likelihood estimator. And we'll get to a point where we realize that what we're maximizing in maximum likelihood estimation, basically, is just off by a negative sign from what we minimize in least squares. And importantly, what we need to do here to show their equivalence is make one further assumption. And this further assumption is that the error terms, the Epsilon Eyes are not just mean zero, but they are mean zero and they have a normal distribution, so bell shaped curve. And then some of the other assumptions are here also, like independence among error terms, identically distributed with the same variance. So is hidden here. But really, what we're adding to the list is this idea of normality, which may or may not be plausible, right? It depends on the data that you've collected, and we'll look at ways to check normality and some of the other assumptions in a future module in the course. So first, let's write down the marginal PDF for y. So we'll have the pdf for y, and it will be dependent on the parameters in the beta vector. So the marginal PDF of a normal distribution is 1 over the square root of 2 pi sigma squared, times e, raised to the minus 1 over 2 sigma squared, and then times yi minus the mean of yi, I'll use shorthand here. And we'll call this the mean of y, so we square that, that's the marginal PDF. But let's just make a note that this mean here is just that linear term, right? It's beta plus beta one, xi1 plus beta p xip. Now, if we want the joint PDF, well, because we have independence, we can multiply the marginal PDFs together. So the joint will be a function of all of the data. So I'll put a y vector here, also our beta vector. So if we multiply our terms together, we'll have any of these terms here. So we could write this as 2 pi sigma squared to the negative and over 2. So that takes care of the fact that we're multiplying together end terms, and we have a square root in the denominator. And then if we multiply a bunch of exponentials together, we can sum the terms in the exponent. So we'll have a constant out front 1 over 2 times sigma squared, and then we'll have sum from i equals 1 up to end, of yi minus mu y squared. And again, this new of y term is exactly this year, right? This linear equation. Okay, so maximum likelihood estimation, hopefully you remember, says to maximize the likelihood function while the likelihood function is related to the joint PDF. And in fact, it's the same form as the joint PDF. But it's just a function of the parameters Beta, instead of being a function of the data. So the data in the likelihood function are fixed, the parameters are free to vary, and the maximum likelihood estimator chooses the value of the parameters that maximize the likelihood function. So, really, we've already written down the likelihood function as long as we think of this whole thing here as a function of the beta values, which are again hidden in this mu term. So we'll skip the step of rewriting this and just calling it a function of betas, and we'll write down the log likelihood function, which is often what we minimize instead of the likelihood function. Now the reason for that is because the maximizer, the value that you plug in to get the max value, will be the same for the likelihood or the log likelihood. And it's often easier to maximize the log likelihood. So the log-likelihood will denote with a script l, and it will be a function of beta. And if we take the log of this entire thing, we have the log of a product. And so we can take the sum of the log of each one of these terms. So just basic log laws. And then if we have the log of this term here, we can use another log law that says we can take this down in front of the log. So doing a few steps at once here, we have negative n over 2, times the log of 2 pi sigma squared. And then we would add to that the log of e raised all this stuff. Well, log ne are inverse functions. So what we'd end up having is just minus 1 over 2 times Sigma squared, times that sum, The log and the exponential effectively cancel out, and we're just left with what's in the numerator. [COUGH] So for maximum likelihood estimation, we maximize the log likelihood, and basically we find the values of beta that maximize the log likelihood function. And think about how this will work, well, if we maximize this function with respect to beta, there are no betas in this term. So the maximization procedure won't take into account this term, to find the maximizers we won't need to deal with this term. So basically, we can forget about this term, and we're just dealing with this term here. Now, what is this term? Well, again, this constant right here, 1 over 2 times Sigma squared. This is a constant that doesn't depend on beta. So if we're just looking to maximize in terms of the Betas, this term doesn't matter and we could get rid of it. And then we would just have negative of this sum of the squares, of these residuals. So, really, this term here is what we defined to be the residual sum of squares in the last video. And maximum likelihood says that we should maximize the negative residual sum of squares. Well, that's exactly the same as minimizing the positive residual sum of squares. So that shows us that least squares is the same as maximum likelihood under the assumption of normally distributed independent errors.