In this video, we'll describe scenarios in which prediction using linear regression or statistical models more broadly, can go wrong. So linear regression is a really powerful tool. But having the wrong model, for example, because regression can't capture really complicated relationships, that's one possible way to have the wrong model, this can lead to bad predictions. So the lesson there is to really try to make sure that you have a model that's sufficiently complicated, so that you are making reasonable predictions. But there's a flip side to this coin, which is that having a model that is overly complicated could also be bad for making predictions. This fact stems from this trade-off between the systematic relationship that we see in our data, for example, in a simple linear regression, you see a systematic relationship between the response y and a predictor x, but there is also a random error term associated. There's some variability that's not incorporated in this systematic relationship, but that is random error. An overly complicated model can attempt to model that random error when it shouldn't, and that's called overfitting. The problem of overfitting is really a problem in prediction and explanation. But for prediction, you'll end up making predictions that are way off because you're modeling random error instead of just the systematic relationship. To see this, consider the plot here. The plot was created from a simulation where I created this data using the relationship y is equal to 5 times x plus epsilon. What we have here is a zero intercept and slope equal to five. I generated the x values uniformly between negative 1 and 1. From there I fit an 11th degree polynomial to the data, that's the gold curve. Now this model clearly overfits the data since it wiggles back and forth between the points and you can see that actually the fit is really poor in some regions, and that's because of a phenomenon that happens when you fit a too high degree of polynomial to a small amount of data. But the basic lesson that I want you to see here is that this polynomial relationship, well, it seems to capture some of the systematic variability, namely the upward trend in y as you move along the x-axis, but it also captures a lot of random variability. So the predictions will be off if we tried to add, say a new point to this and tried to predict it using the model, we would be not doing so great with predictions. A poor fitting model is one way that you could make bad predictions where the predictions can go wrong, another possibility is what we could call a quantitative extrapolation. Now, this occurs when predictions are made at values of the predictor that are in some sense very far from the predictors that were included in the training dataset. So if you train your model on a whole bunch of predictor values, and then you try to use the model to predict a response at a set of predicted values that are very far from what you train the model on, there's no reason to think that the relationship in your predictor space in the training set holds outside of that set of predictor values. So if we actually use a model like this to predict far outside of its predictor space, we're really making an additional assumption that the relationship between the response and the predictors holds far outside of the predictor space in the dataset that you trained on. That's an assumption that should be made explicit if you end up using quantitative extrapolation. Typically this should be avoided unless we have really good reasons to think that that relationship extends beyond where you have measurements. So in this slide, you'll see the same data from the previous example on poor model fit. But here I fit just a regular linear regression model, but I've added one single data point and the data point is way outside of the interval, negative 1-1, which is where the rest of the predictor values are. This predictor value is at 2 and the response value as well, somewhere around 3. You'll notice that if you went to make a prediction based on the model, the gold line here, you would be far over-predicting. This is an example of a place where, if you go far outside of the predictor space, you might have a different relationship, so it may be the data. If we had more measurements, you would see a downward trend, say after one and this would mean that the linear model does not predict well outside of the region negative 1-1 and we would want more data so that we could capture that non-linear relationship. So the moral of the story here is again, prediction can be quite bad if we have a bad quantitative extrapolation. Now, predictions can also go wrong in a case called qualitative extrapolation. What we mean here is that qualitative extrapolation occurs when a model is trained on data from one population and then used to make predictions about another population, so that's one way to put it. For example, think about the data related to sales of our product P, so we worked on this dataset several times now. Suppose that a company enters the market for product Q, another product. This company has a YouTube marketing budget, a Facebook marketing budget and a newspaper marketing budget. If we use the model that we trained on data related to the cells of product P to try to predict the sales of product Q for this new company, we'd be making a qualitative extrapolation. Now, we may have good reasons to do that. There may be good reasons to think that there's same relationship between marketing budgets and product P as there is between marketing budgets and product Q, but we have to have those reasons and those again, are an additional assumption that we're imposing on this modeling situation. What would be better is if we actually had data on product Q and the marketing budgets for product Q. All right. Those are some places that prediction can go wrong, and those are things that you should keep in mind when you're doing your own data analysis and in your work as data scientist.