In this video, we'll describe how to make predictions using the linear regression model and we'll also discuss a metric for comparing predictive linear regression models. Let's recall that a predictive model is a statistical model used to provide a value or as we'll see, potentially a range of values of the response based on values of the predictors that were not used to train the model. That part is important, not used to train the model. We can use regression models for a prediction and will do so by supposing that we have a model like this one here. It's worth unpacking this a little bit and noting that in this situation, when I say we have a model, what I really mean is that we have the linear form. We think that the assumptions are met for the data that we trained it on and the data that we'll use to make predictions on. We've fit the model and found this estimator for our parameters. In this equation, the important part of the model is the linear form and this set of estimators. This X is our design matrix, it contains the data that we use to train the model, the set of predictors, and this here are the response measurements used to train the model. Everything here you could think of as data used to train or fit your model. Now when we make predictions, we will take the estimates of the parameters and we'll use them on new data. We imagine that we measure some new predictors and we get a matrix where we have a column of ones and then we have, say, a new measurement of the first predictor and then all the way out to a new measurement of the p'th predictor. I'm using stars to denote that these are new measurements, measurements that we're not used to train the model. Then potentially we have several new sets of predictor values all the way down through maybe a x_k, so a k'th measurement of the first predictor through an x_kp star. Every row other than the first one gives a new measurement of the first predictor all the way through a new measurement of the p'th predictor and then we might have, say k of those. These will be things that we've measured and the goal is to use the model that we fit and the new values of the predictors to come up with a predicted value of the response. We'll say this is measured. Then this value of the response, we'll call this a y_1 star through y_k star. This here is predicted, so it's not measured. We use the model to come up with a prediction. We could call this a point estimate of our predicted value or just our predicted value. We should note that in a previous lesson we saw that this value is also the point estimate for the average response. We discussed that in the context of confidence intervals for the average response, but what we'll see is that when we look at prediction intervals, those will be different from confidence intervals and we'll cover that in the next lesson. If you are interested in making a prediction from your regression model, once you fit the regression model, you get a new set of predictor values, then you can compute the right-hand side of this equation to get the left-hand side and the left-hand side would stand in as your predictions for what the response would be at each of those values of the predictors. Just in case you want to see this, not just for several new values of the predictors, but just one. You can think about this here as just a vector. The vector 1, x_i, 1 star, all the way out through x_i, p star. Instead of having several new values of the predictors, you just have one. This is what the equation would look like, and of course, this here is just Beta naught hat plus Beta_1 hat times x_i, 1 plus all the way through Beta p hat times x_i, p star. Now that we've defined a point estimate for prediction, we can introduce an important model metric for predictive models, namely the mean squared prediction error. That's what's being shown here, and let's try to analyze this quantity. What we're doing is, we're looking at deviations, and then we're squaring those deviations, summing over the number of those deviations, and then dividing by k. It looks like it's a sample mean of squared deviations. Now what are those deviations of? Well, they are of the true value of the response given a new set of predictors minus what the model would predict at that new set of predictors. The second line here is just showing you exactly what this y_i* hat looks like. Well, it's the new value of the predictors times the model that you fit on the data in your original dataset. I think it's important to analyze this quantity in terms of what is and what is not an observable quantity in the context of prediction. This term here, x*_i times Beta hat, this is observable. The x*_i term is a vector of values and it's a value of the predictors, and those are the values that we're interested in predicting the response are. Technically, this is a row vector, and the first entry of the row vector is a one to correspond to the intercept term here. Then every subsequent value in this vector is a measurement of the, say, jth predictor for j_1 through p. Beta hat was calculated from the data used to fit the model. This we have based on the fitting of the model, and then y_i* is not observable since it's the quantity that we want to predict. Either this quantity hasn't happened yet, or we couldn't observe it. It wasn't in our original dataset. If we have this value, we wouldn't need to do prediction. Then we would know the answer and we wouldn't need to predict it. Basically, if we use all of the data in fitting, so in coming up with our Beta hat, then the mean square prediction error can only be computed after the value being predicted is measured. If we use all the data to get our Betas, then we won't have any of these leftover, and we can compute this quantity only after those data come in, only after we collect them. This isn't super helpful, and it really means that we can only assess the predictive model using the mean square prediction error after the prediction comes in, which is maybe not all that helpful. This fact is really why statisticians and data scientists will split up their dataset into a training set and a testing set. Now the training set is usually somewhere around 80 percent of the data in the original dataset. So 80 percent goes into fitting the model or training the model, and then the remaining 20 percent or so is used as a testing set and really used for deciding whether the model actually works well in predicting new values. In these cases where we have a training set and a testing set, the Beta hat is the least squares estimator, fit to the training data, and the x_i* is a vector of predictor values in the testing data, and then the y_i* is the corresponding value of the response corresponding to the predictors x_i*. That's also from the testing data. Note that the mean squared prediction error may be used for comparing predictive models. If we have, say, two or more models, we could compute the mean squared prediction error on the testing set for each of our models, and then choose the one with the lowest value for the mean squared prediction error. In a future lesson, we'll do this in R.