In this video, we'll define some least squares related terms, such as the residuals of the model, the fitted values of a model, and the hat matrix. Then we'll mathematically derive the least square solution for multiple linear regression. First let's start out by defining some terms. The first term that we'll define I'll call the residuals. These residuals will actually be something that we visualized back in an earlier lesson. Remember when we talked about least squares, we said that we would be minimizing the sum of the squares of these vertical deviances, and the residuals are those deviances. They are the mathematical form of those deviances. The notation that we'll use would be Epsilon_i hat. The reason for that is because the residuals can be thought of as something like an estimator of the error terms that show up in the model. If we fit a model well, then the residual should look something like, although not exactly like, the error terms. What are those deviances look like and how can we write them mathematically. Oh, really they were the distance between the measured response and the point on the line which we haven't yet fixed a line because we haven't figured out the least squares solution yet. But we could write down the equation where these Betas will be variables. We can have many different lines or planes if we're in higher dimensions with several predictors. We will take the data point minus the fitted line or surface, Beta_naught plus Beta_1X_i1 plus, up through however many predictors that we have, so say Beta_pX_ip. Convince yourself that this difference is the deviation that we talked about in an earlier video that we visualized. If we think about this as a function of the Betas, then we can choose many different lines or surfaces, and the least square solution would be fixing these values at one that best fits the data that we have, which means taking the sums of the squares of these residuals and minimizing that over the Betas. Now the fitted values for a model are defined in the following way. We'll say y_i hat will be equal to Beta_naught hat plus Beta_1 hat X_i1 plus all the way up through Beta_p hat X_ip. The important thing to note here is that these hats denote estimators. I'll say Beta_j hat equals the least squares estimator. That's for j starting at 0 and going up through p. Of course we haven't figured out how to find these yet, that we're going to do in a moment. But imagine we have the best values where the least square solution, if we plug those in to the linear equation, we'll have the fitted values. Now, the fitted values can be used to basically estimate the mean of the data, and they could also be used to make predictions if we plugged in new values of the xs, and that we'll talk about in an upcoming module. Finally, let's define the hat matrix. The hat matrix is important for theoretical calculations related to least squares. You might come across some of these and I think it's worth it to see the definition of the hat matrix. The hat matrix, we will define as H and it's defined as X, which is our design matrix. That's the matrix that has the column of 1s, and then every other column, a predictor in the model, times X transpose, X inverse times X transpose. On future lessons, we may work with the hat matrix and see why it's called the hat matrix. But for now I just wanted to define it so that we have it in case we need it. Now that we've defined some important terms, let's go ahead and derive the least squares estimator for the linear regression model, and what we'll do is we will minimize the sum of the squares of the residuals which we'll write in terms of matrices and vectors, sum of the squares of the residuals, sometimes called the Residual Sum of Squares and abbreviated RSS, we can write in terms of matrices and vectors in the following way; first we take y, the data vector, minus x, the design matrix, times beta. We take the transpose of that times y minus x times beta. It's not entirely obvious that this is a sum of squares of residuals, but in fact, if you multiplied these terms out, you would see that it is, and another way to think about this is that this is really the square of the 2-norm of this vector here, the vector of the data minus the the mean term. Our goal will be to simplify this residual sum of squares a bit and then once we simplify, we'll take the derivative with respect to beta, set it equal to 0, and get our least squares solution. The first thing we'll do to simplify is just to take the transpose of the factor on the left, and if we have the transpose of a difference, you could just take the transpose of each term, so no swapping of order. But if we do take the transpose of this second term, the transpose of a product, we'll have to take the transpose of each term but swap the order, we should get beta transpose times x transpose, and then still times y minus x times beta. Now our next step will be to multiply these terms out. This should be pretty easy, we just have to be careful that we keep the order the same, we know that matrix multiplication, matrix-vector multiplication, not necessarily commutative and so we should keep the correct order. The first term we'll get will be y transpose times y, this term here times that term. The next term we'll get a minus y transpose times x times beta, that will be this term times this term and we'll get a minus beta transpose times x transpose times y, that's this term times that term. Then the last term we'll have plus beta transpose times x transpose times x times beta. Now let's pause for a second and think about the dimensions here. This first term, the dimensions here will be y is an n by one, so y transpose is a one by n, the outer dimensions are one, the inner dimensions match, and so this here is actually a scalar, that's a one-by-one, and we know that if we add or subtract scalars together, we can't have a scalar here and some non-scalar vector or matrix here. So it should be the case that each one of these other terms are scalars too. In fact they are, I'll leave it to you to do the dimension analysis on each one of these other terms to show that in fact you do get a one-by-one or a scalar. That's important for another reason, not only just to know the dimensions of the output, but if each one of these middle terms is a scalar, then it turns out that actually these two middle terms are the same. The reason for that is because if you take the transpose of a scalar, you have the same thing. You have the scalar itself. Transpose is trivial, when you have a one-by-one matrix scalar. Think about taking the transpose of this first term. You have to take the transpose of each, but then reverse the order. You'll get a Beta transpose, that's the first term, you'll have an X transpose middle term, and a Y transpose which is just Y as the last term. That should show that these two terms are the same. Which means that we can simplify this as Y transpose Y minus, now let's do 2 times Beta transpose X transpose Y. Really we could choose either of these forms. I've chosen this form here. Then the last term is Beta transpose X transpose times X times Beta. Think of what we've done. We've started with the residual sum of squares, the quantity that we want to minimize. We've done some simplifying and we've gotten to this point here. That's the equation that we're going to take the derivative with respect to Beta and set it equal to 0. We'll have partial derivatives of this residual sum of squares, derivatives with respect to Beta. Notice the first term, that derivative is easy and this is a constant with respect to Beta. No Beta show up here. Derivative of a constant is zero. Now this middle term here might look tricky, but actually, what we have here is a constant negative two. We're just going to copy that down. Then we have Beta transpose times X transpose Y. Now from a Lemma in a previous lesson, we can deduce that this X transpose Y will be the derivative of Beta transpose times X transpose times Y. We'll be left with the negative two times X transpose times Y. Then the derivative of this quadratic form we saw as another Lemma in the previous video. That will be equal to 2 times X transpose times X times Beta. This is what we're setting equal to 0. Now if we're setting this equal to zero, we'll swing this term over. Notice that we'll have a two on each side, so the twos will cancel. A little bit of simplification will get us that X transpose times X times Beta is equal to X transpose times Y. We have a matrix times a vector is equal to this here is a vector. Now, to solve this system, if the inverse of this matrix exists. It's plausible that an inverse to this matrix exists because it's a square matrix, right. It's a p plus 1 by p plus 1 matrix. Only square matrices can have inverses. At least it's a candidate for having an inverse. Let's assume for now that the inverse of X transpose X exists. It exists under certain conditions that we'll talk about later on in the course, but let's right now just assume that it exists. Then we can multiply by the inverse on each side. We know that the inverse of a matrix times the matrix itself is the identity matrix, which we won't have to write once we perform the multiplication. We will get on the right-hand side, X transpose X inverse times X transpose times Y. Now that we've arrived at our solution we'll put a hat on top which in statistics, as I'm sure you know, means that it's an estimator. Parameters we typically denote with Greek letters like Beta, and then if we put a hat on top of a Greek letter, that typically denotes that we have an estimator, which we can calculate from sample quantities. Everything here we can calculate from collecting a sample, the response variable y, the x's. This is our least square solution. Really this holds when x transpose x inverse exists and we'll talk about those conditions later on. When you're working with real data in r, or if you end up using Python or some other programming language that has a built-in function for linear regression, among other things, that function will compute the least squares estimator. The LM function in r will give you the least squares estimator for your betas. Now, it doesn't exactly compute it in this way, r does not use this exact formula, and the reason for that is inverting a matrix can be costly, the procedure can be sensitive to small perturbations, and r will use some numerical algorithm to come up with the ordinarily squares estimator. Now we won't worry about the details of what r actually does. But it will be important for us to notice the form of this least square solution because we will study properties of this solution. But if you're interested in what r actually does, you can investigate the QR factorization. It's a matrix factorization, and it gets at what r is doing in the lm function. Now least-squares is making some assumptions, and I wanted to spell out those assumptions explicitly here, so that we know if we have violated one of these assumptions, the least square solution won't be best. It won't be the best solution in a sense that we'll define soon. First we assume that the error term in the model is mean zero, and that's for every error term in the model, that's an important assumption. There's no shift in the error term, it's centered at zero. The second assumption is really just stating the linearity property. Basically we're saying the expected value of the response before we collect the data. When we're treating this as a random variable in expectation, this will be equal to the linear form. I've written it in a way here that I hope we can understand this here is just a vector containing a 1 and then an xi_1 through xi_ n. This is just the ith row of the design matrix. Really all that this says is that the expected value of yi is equal to beta naught plus beta 1 xi_1 through beta pxip, and that should be true for all i one through n. Now the third assumption has lots of information packed in it, let's try to unpack it. What this says is that the covariance between error terms, is zero, that means they're uncorrelated with each other, for i not equal to j. That means if you have two separate terms, the covariance is zero, no correlation between errors. It also says that if i is equal to j, well, the covariance of say, ei with ei, that's just the variance of that error term. This also says that the variance is constant whenever you have the covariance of something with itself, namely the variance, then you get a constant term. It does not vary with whatever index you happen to actually have, and again, that's for all i_1 through n. The final condition that we need for the least square solution to hold is that that X transpose X inverse matrix exists. On the last slide, we saw that we couldn't have performed that last computation without the existence of this inverse. We need to assume that it exists, and we'll talk about situations in which it might not exist.