So in this lecture, we're going to try and explain the relationship between training and validation and test performance. In the previous lecture, we covered exactly what the roles are of training validation and test sets, here we're trying to characterize relationships between performance of some model on those datasets. That's going to briefly help us to describe how these theorems can actually be used for model selection, when we go to implement it in code later on. So to recap what we previously saw, we showed this relationship between validation, training, and test sets, and we described how we can actually use the validation set to select model hyperparameters. The question is, in practice, how do we take these training validation and test sets, training validation and test errors, see how they relate to each other, and use that to guide us to select the best model? Again, recapping our basic setup should consist of the following, we have a series of hyperparameters, in this case, it might just be that at lambda value which trades off model complexity versus model performance that we're trying to experiment or trying to select the best value of each of those hyper-parameters, we get one model. So for each value of lambda, we might get one value of theta from our training set and we'll then use our training validation and test sets somehow to evaluate model performance. So our theorems are the following. First of all, our error should increase as lambda increases. So what does that really mean? We have this plot here which on the x-axis shows the model complexity. In other words, the value of lambda which trades off complexity versus accuracy. So on the right-hand side of that plot, we have large values of lambda, that means we're penalizing complexity more. In other words, we will have a less complex model. On the left-hand side of that plot, we're penalizing complexity less. In other words, we'll have a more complex model. The extreme case is lambda is equal to zero, we would not be penalizing complexity at all. All we would be doing is trying to minimize our error or RMSE or something like that. So, as you increase the trade-off versus of accuracy versus complexity in favor of accuracy, that's the left-hand side, you'll have lower error as you increase it in the favor of having less complexity, you'll have less complex models and higher error. Typically, if you've implemented your regularizer well as your regularization strength or the value of lambda increases the model should gradually start to behave like a trivial model. So for example, if we had a model with just an offset term and a bunch of parameters capturing the effect of different features, if we just regularize those parameters than if we regularize them very extremely, the model would be doing nothing but predicting a constant all the time. So, a naive model that predicts a constant all the time much predicted mean all the time or it's mean squared error would be equal to the variance as we showed in a previous lecture. So as your model becomes simpler and simpler, it's not really doing anything but predicting a constant all the time, so it will gradually asymptote towards this performance of a trivial model. Second theorem, says that your validation and test errors should be larger than the training error. So this should not come as too much of a surprise and it's probably a good way to sanity check some of your model performance. So, all of this happening here is we're saying that when we expose the model to new data which is what the validation and test sets are doing, we expect that it won't perform as well on the data that was used to train the model. So, we would expect generally those two curves, the validation and test error curves would sit above the training error curve. Second part of this theorem, we can use the shape of that curve to identify what is meant by underfitting versus overfitting. So, overfitting which we covered previously is what happens when we have a model that's too complex. It works really well in our training set but it doesn't work well at all when we expose it to new data. So, that's what's happening on the left-hand side of this plot here. We have a model that has very low training error or training MSE where we expose new data on validation and test sets performs poorly. That's called overfitting. So the other side of that curve which we haven't covered so much is what's called underfitting. In this case, we're penalizing complexity a lot which means we're fitting a very simple or not complex model. In this case, well, it doesn't generalize well to new data because the model is too simple but it didn't work well on our training set either. We just have a model that penalized complexity too much so it really wasn't making very high fidelity predictions, and that's called underfitting. Third, there should be a sweet spot somewhere between overfitting and underfitting. This is really the core part of the model selection process. So, that sweet spot would be the model we would ultimately select. So once we can find the best value of lambda in terms of its performance on the validation set, that would be the bottom of this curve and that would be the model that we would choose. So, those are our three theorems. The error should increases as lambda increases, validation and test errors should be larger than the training error and there should be the sweet spot between under and overfitting which is the model we ultimately select. So finally, a few notes have warnings about these theorems, we need to explain why I put the word theorems in inverted commas here. So, first off, due to randomness and real datasets, the theorems may not always hold precisely. So, you could have outliers in your data sets or you could have very small datasets in the first place and you could not necessarily guarantee that when you exposed the model to new data that these theorems would exactly hold but they're good guidelines. If you have a reasonably large data set, they should hold and if they don't hold, it might be assigned that there is a bug in your code rather than a problem of the data set. So it's a good way to sanity check your training validation test pipeline just by checking whether these theorems hold or whether they're approximately hold. So really this theorem should hold assuming you have large enough data sets and that your training validation and test sets are randomly sampled. If those are violated, you may not expect these theorems to hold or they may not hold as well. Finally, if we were maximizing accuracy rather than minimizing error as we were doing in the plots I showed, the plots will just be the other way up. When I talk about error going down it's the same thing as accuracy going up if you've met your regularization pipeline correctly. So finally, this is basically what the validation pipeline would look like. In a later lecture, we'll actually implement this with some code. So we select some range of values for lambda, I usually use powers of 10 but I'll give guidelines on how you can use those values later on. For each value of lambda, we train one model, on the training set of course, that gives us a value of theta and then we can compute the performance of that model on the validation set, that gives us the validation error. Next, we select whichever model had the lowest validation error or the highest accuracy and we compute its performance on the test set. In this lecture, we just introduced several theorems that characterize the relationship between training validation and test performance and we showed how these concepts might be used within a training validation and test pipeline. So, on your own, I would suggest trying to implement not quite the entire pipeline but at least to compute the training validation and test error on one of our previous examples. For example, on a bankruptcy data just to experimentally confirm that these theorems as I've described them really do hold in practice.