Lets cover some basic examples of multivariable linear regression models. So, we've already covered a bunch and perhaps the simplest one is where we want to just minimize Y minus a constant vector times an intercept, and we saw that the minimizer of this led 2 beta 0 hat equal to y bar. And of course, we can do that using multivariable regression formula by looking at the solution, beta 0 hat, which should just be the Jn transpose Jn inverse Jn transpose y. Okay, so this term right here is just the inner product of a vector of 1 with itself. So 1 times 1 is just 1. So we're just adding up a bunch of 1's and then taking the inverse of that equal to 1 over n. And this is just a bunch of 1's times y. So that's just going to be the elements of y added up. So that's times the summation of the elements of y, or y bar. The second case that we looked at is the instance where we have y minus x beta, where x is a vector also. Squared, but we want to minimize the least squares criterion like that. This is so called regression to the origin. Of course, the regression to the mean is a special case of that, where x is just a vector of 1s. And we saw already that the result of that was inner product of x and y over x with itself. That was beta hat. Well, let's just show that that agrees as a special case of multivariable regression. So we have our beta hat = x transpose x inverse x transpose y. Well, x transpose x inverse is just the inner product of x in the event that x as a vector and x transpose y is just the definition of inner product of x and y when x is a vector. Okay, so two of our special cases are obviously variations of linear regression. The next one I'm going to ask for you to do on your own because we've kind of worked on it a lot and shown it in different ways, but in the event that we want to do linear regression, beta- x jn + beta1 x vector x, so our design matrix in this case. Let me just call this trying to minimize y- w beta squared, where w = the vector of 1 x the vector of x. So I should probably just write out like this, jn and the vector x. So what I'd like you to try and show is now when you are beta hat where beta here, again, I'm defining as beta 0 beta 1, our beta hat is going to be w transpose w inverse w transpose y. So what I'd like you to show is that this works out to be the standard definitions for the linear regression intercept and slope. This winds up being a 2 by 2 vector so the inverse, you can actually calculate, just look up the inverse of a 2 by 2. And this is fairly straight forward to do as well, that winds something a 2 by 1 vector. So it's a little bit tedious in terms of book keeping but you can show that just direct use of the multi variable least squares solution winds up in a same result in the three other cases that we spent a lot of time talking about so far. Let's consider another important special case of when your model's applied in a pretty general setting. So imagine if our y looks something like this. It's y11 up to y1n over 2 and y21 up to y2 n over 2. So in other words, our y is really equal to two vectors, let's say y1 and y2 where the first comes from one group and the second comes from another group. So we might think of a setting where we're plotting y and we have Group 1, Group 2, something like a box pot. So some instance like that where you're interested in modelling the fact that there's two groups using least squares. So we could do this with Y minus let's say X beta minimizing at least squared criteria where x is equal to a bunch of 1s and a bunch of 0s and then a bunch of 0s and a bunch of 1s. So x has n over two 1s in the first vector and n over two 1s in the second vector. Now, let's work out what beta have. Works out to be in this case, so beta hat works out to be exactly equal to of course X transpose X inverse X transpose Y. But X transpose X, okay so that's just the vector 1s and 0, a vector of 0s and 1s times 1s and 0s and 0s and 1s. And we want that inverse. I'm going to get rid of this. And then our next component is X transpose Y which is the vector of 1s and 0s and 0s and 1s and times Y. Okay, so looking at this matrix, this is going to be n/2, because when I multiply this matrix, this vector times this vector, it's going to just add up the number in that first group. When I multiply this factor times this vector, it will be just 0, same thing for this other diagonal and this one would N over 2. And there's nothing in particular about having equal numbers into two groups. They could have been an N1 and N2 there, I just did N over 2 just in a balance case for this equal number and both groups. Now let's look at the statement right here. This first one is going to be the sum of the first group. So let's just call that JN over 2 times Y1 and the second one is just going to be J transpose and over 2 times Y2. Okay? And so and I'm sorry, that's inverted. And inverse is pretty easy, because then it's a diagonal matrix, so it's just 1 over both of those. And then, so what we get is that y1 bar and y2 bar are the slope estimates for beta. Which is what we imagined should happen. If we have an effect one for group 1 and the second effect for group 2, the likely estimate would have to have turned out to be the average for group 1 and the average for group 2. So the fitted values in this case. Is just going to be, if you're in group 1, it's going to be j n times y1 bar. I'm sorry j n over 2 time y1 bar if you're in group 1. And j n over 2 times y 2 bar if you're in group 2. Now remember last time, we were considering a setting where we wanted to do Y minus, let's say x1 beta. And minimize that where x was let's say Jn1. And then a bunch of 0s. And then a bunch of 0s and Jn2 where J is, again, a vector of 1s. Let's say x1 is like that. And we found, so we have two groups of data and we found out that our estimate then works out, our beta hat works out to be y1 bar y2 bar, okay. Where again our beta is equal in this case the beta 1 and beta 2. Now consider y- x2 gamma, consider minimizing the model, where x1 is equal to a vector Jn1+n2 and then a vector that's Jn1 and then a bunch of 0s. Let me just write it as 0n2, meaning a vector of 0s of length n2. Now, so this is 0n2, and this is 0n1. And this is x2, I'm sorry, and then gamma is equal to gamma 1 and gamma 2. Now notice if I add these two columns, right? I get this column right here. And similarly if I take this column and I subtract this one. Then I get the second column here. So what we see is that x1 and x2 contain the same column space. They have an identical column space. And what we know from our projection argument is that the fitted values from both the models have to be the same. Well the fitted values from model 1, right? For any observation in group 1, the fitted value is going to be y1 bar. And for any observation in group 2, it's going to be y2 bar. So, we know that beta 1 hat equals y1 bar, and beta 2 hat equals y2 bar. We know that, because we worked it out in the last example, where we figured it out. Okay, now look at x2 times gamma. Well, the fitted values for anyone in group 1, are now if I multiply X 2 times gamma, is going to be gamma 1 hat plus gamma 2 hat. And then, for anyone in group 2, it's gotta just be gamma 1 hat by itself. So we know that the fitted values have to satisfy these equations. And they have to agree, because the column space of the two is the same. So, what we know then, is that beta 1 hat, which is Y1 bar, has to equal to gamma 1 hat plus gamma 2 hat. And beta 2 hat, which is equal to y2 bar, has to equal to gamma 1 hat. We can use that to now solve for gamma 1 hat and gamma 2 hat, without actually having to go to the trouble of inverting this matrix. Now, it's a 2 by 2 matrix, so it shouldn't be that hard to invert. But let's suppose you had a little bit harder of a setting, then it would be a little bit harder to invert. Let's suppose we just had ten columns. And this is kind of a common trick in these ANOVA type examples where you can re-parameterize to the easy case where you get a bunch of block diagonal 1 vectors like in the case of x1 in which case x transpose x works out to be a diagonal matrix, and then very easy to invert. And then if your want any different reparameterization, which would result in an x transpose x that's hard to invert, you can use the fact that you know the fitted values have to be identical to convert between the parameters after the fact. So in this case you know that gamma 1 hat has to be equal to beta 2 hat. And then you know that gamma 2 hat then just plugging in with those two equations has to be equal to beta1 hat minus beta2 hat. Okay, and so that gives you a very quick way to go between parameters. In the equivalent linear models with different specifications, with just different organization, okay? So it's a useful trick when you're trying to work with these ANOVA type models. Now let's discuss what I think of as one of the most important examples in regression. So, imagine that my Y breaks down into two vectors, a group 1, and a group 2. And my design matrix, which I'm going to call W, which will become clear for reasons later is equal to a matrix called z and a vector called x. Where x is in n by 1 and z is in n by 2. And z looks like this, z looks like Jn1 and then an n1 vector of 0s. And n2 vector of 0s, I'm sorry. And then an n1 vector of 0s, and then Jn2. So the z matrix looks like the two way ANOVA matrix from perviously. But we've appended an x onto it as well, an x vector. So this is the example if we do least squares with this w, we are interested in fitting models for we have our aggression line but, separate, separate intercepts for two groups right? So the coefficient in the front of the x is the common slope and then the coefficient in front of each z vector is the intercept for each of the groups. Okay. So we want to minimize y-w and I'm not going to call it beta, let me call it gamma. Quantity squared. Where gamma is equal to mu1, the intercept for group 1, mu2 the intercept for group 2 and beta the common slope across the two groups. So, we can write this as y- x beta- z. Let me just call it mu as the vector of mu 1 and mu 2. Okay so we can write it out like that. And then let's figure out what this works out to be. So let's use our standard trick where we hold beta fixed and we come up with the estimate. For mu condition as it depends on beta. Well, if beta's held fixed, that's just a vector and this is just the two-way ANOVA problem that we discussed previously. Remember that the two-way ANOVA problem worked out to, the solution worked out to be the mean in group 1 and the mean in group 2. So the estimate for mu1 as it, depends on beta has to be the mean of this vector right here. So that has to be Y1 bar, the group 1 mean for the Ys minus X1 bar beta. And then, U2 the mean for group two as it depends on beta has to be y2 bar minus x2 bar times beta. Now if I were to plug those back in for mu 1 and mu 2 into here and subtract them off from y, what I get is nothing other than The center version so fy. So I get y1 minus y1 bar times Jn1. y2 minus y2 bar times Jn2. That vector minus x minus x1. - x1 bar, x2- x2 bar. I didn't define x1 and x2 but let me just say those are just the group components of x, x1 is the first n1 measurements of x and x2. And x2 is the n2 latter measurements of x. And that should be times beta. Okay? So oops, and I shouldn't say equal. It has to be greater than or equal to because we've plugged in the optimal estimates for mu 1 and mu 2 for a fixed beta. Well, this is now nothing other than regression to the origin with the group centered version of y and the group centered version of x. So we know that the estimate of beta, beta hat. The best beta hat that I can get has to work out to be the regression to the origin estimate from this status. So that's just summation, let me write it out this way. The double sum of, well, here, probably the easiest way to write it out first is Y tilde, the interproduct of Y tilde and X tilde over the interproduct of X tilde with itself. Where Y tilde, is the group center version of Y, and X tilde is the group centered version of X. In other words by group centered I mean each observations with its group mean subtracted off. And you can show, and I have this in the notes, okay. And you can show, well let's just do it really quickly here. What does this work out to be? This works out to be the double sum. Let's say over i and j of yij- y bar i. So i = 1 to 2 and j = 1 to n sub i, xij- x bar i all over the double sum of x i j minus x bar i and let me just explain my notation here. So why i, j is the jth component of let's say y 11 is the first component of the vector y1. y12 is the second component of vector y1. y21 is the first component of vector y2 and so on. So we can write this out, and I think this is probably the nicest way to write it out, as for i = 1, we can write this out as p times beta 1 hat plus 1 minus P times beta 2 hat where beta 1 hat is the regression estimate for only group 1, if you only had the x1 and y1 data, the center of x1 data and the center of y1 data. And beta 2 hat is the regression estimate if you only had the y2, the centered y2 data, and the centered x2 data. Okay, so it is interesting to note that the slope and covo works out to be an average, a weighted average, of the individual groups specific slopes where in this case, p works out to be the summation of ( x to the 1 j- x to the 1 bar). Over sum. Oh, and this should've been square, sorry about that. The double sum of x1j minus x1 bar squared, okay? So P works out to be the percentage. And yeah, xij minus x bar i. So P works out to be the percentage of the total variation in the x's from group one. So if most of your variation in your x's is in your group 1, then the group 1 slope contributes more to the overall ANOVA slope. And if the group 2 is more variable then the group 2 contributes more, and if they're equally variable then both of them contribute equally. Okay. So let's go back to now, once we have our beta hat, we can figure out what our mu1 hat and our mu2 hat is. So mu1 hat is equal to y1 bar- x1 bar beta 1 hat, and mu 2 hat is to equal to y 2 bar minus x2 bar beta 2 hat. Okay, so the difference in the means mu 1 hat minus Mew 2 hat works out to be y1 bar- y2 bar- x1 bar- x2 bar beta hat. Now, one way to think about this, the most common way to think about ANCOVA Is the instance where you want to compare treatments, treatment 1 versus treatment 2. But you have some confounding factor that you need to adjust for. Say, for example, you're looking at a weight loss treatment and your confounding factor is the initial weight of the person. Okay, and so if the initial starting weight of the people receiving the one diet, one weight loss treatment is different than the initial weight of the other weight loss treatment, then you'd be worried about just directly comparing the two means. Well, this shows what, in addition, to the two means, you need to subtract off. If you model the data as an ANCOVA model. Most interestingly, is if you randomize, and your randomization is successful in the sense of balancing this observed covariat, the baseline age. Then the group 1 average should be pretty close to the group 2 average. So this difference in means should be quite small so that whether or not you adjust for baseline weight or meet baseline weight for the model and just to straight two group of ANOVA, the estimate should be very similar. However on the other hand if you happened to not have had some randomization and had to have in having imbalance so that the average for group 1 is very different from the average for group 2 then the difference between the unadjusted estimate and the adjusted estimate can be quite large. Okay, so that's ANCOVA, that's an important example. I have some more written about it in the notes, but I think you can actually learn a lot about regression and adjustment just by thinking about this one example.