In previous videos we've seen how different kinds of learning algorithms can be used to train qualms that perform supervised learning tasks, such as classification and regression. Now, let's take a look at how the performance of these columns can be quantified so that we can identify the best learning algorithm to build the qualms. After threatening dire consequences for not holding out test data properly now we're going to talk about how you should use that test data. We'll introduce three error measures and show how each is calculated. We'll also discuss pros and cons of the different measures. By the end of this video you'll understand how error measures are used and be able to contrast the effects of different approaches. First, let's review the regression problem framework. Remember, regression is about finding a function h that uses some set of input features to predict some real valued number. Regression learning algorithms find that function by fitting a model from its hypothesis space to the data set. It finds the model with the least penalty according to the training data. Now once that model has been chosen we step in. Now we use our test data so that we have a valid estimate of how well our model will perform on the operational data. How do we do this? Remember, our test data came from the labeled learning set, so we know what the correct answers are for those examples. You can think about it is having Xtest, the feature matrix for these labeled test examples, and Ytest, or Y, the correct answers for each associated row. We test our regression qualm by feeding it Xtest, and then we have its best estimate of the answers in the vector Y_hat. Remember we call it Y_hat when it's the predicted values rather than a restored labels. So, whatever error measure we use it must have something to do with the differences between Y and Y_hat, and as usual, we need a precise measure in order to report on the test error of our model. The formula we use should look familiar. We take the difference between the values predicted by the model and the stored values, Y_hat- Y, and just like before we have a choice about how to deal with those differences. Again, we don't care about the sign in this case, We only care about the magnitude of the difference. So our two options are to take the absolute value or square them, then divide by the number of test examples so that we have an estimate of the average per prediction error, or mean error. When we take the absolute value we're calculating mean absolute error, mean because of averaging, absolute because of the absolute value function, easy. If we square the differences then we've got mean squared error, or MSE. If we want to convert the units back to their proper form we can take the square root of that total and that's known as root mean squared error, or RMSE. We might want to report RMSE rather than MSC just so that we have some intuition about what that average error actually means, that the predictions are on average off by that amount, but it's unnecessary to do the square root operation if we're mostly looking at comparisons. In all cases lower's better. A perfect pottle would have zero error whether it's MSE, RMSE, or MAE. But remember, a perfect model on the test data is extremely suspicious unless you have reason to think there's absolutely no noise in the phenomenon you're asking the question about, and no noise in the data, and the phenomenon is completely explainable by the hypothesis space your learning algorithm operates in perfection means you've somehow ended up training on the test data, or okay, you're extremely lucky. After checking on data leakage between training data try holding out a different test set to see if the results are the same. All right, we have these different functions for measuring error. When do we use each? MSE is the most common error measure, and by now you probably have a sense of why you do love those differential functions. There is one side effect of squaring the differences that we alluded to earlier, squaring makes big numbers even bigger. So while mean absolute error penalizes all mistakes equivalently, mean squared error is increasingly sensitive as the errors get bigger. In other words, by MSE estimation the model's better off making several small mistakes than one big one. The practical consequences of this are that MSE sacrifices perfect fit on a few data points to compensate for big mistakes. In other words, one outlier in the test data makes our model look worse under MSE than it would under MAE. Sometimes this is what you want. If you have confidence that your outliers represent actual phenomenon that needs to be modeled you should be emphasizing large mistakes. There's another factor we might want to consider in our assessment of our models. Remember back in course one we talked about how different kinds of mistakes can matter more than others depending on your problem. Maybe there's some reason that underestimating values is not so bad as overestimating them, which can be especially true when the model will be used for classification, or maybe there's a certain range that's really important to get right. In that case you want a weighted error function. A weighted error function, as you might expect, lets you assign different weightings to different kinds of error, for example, penalize overestimating values twice as heavily as underestimating. This is more common in classification error measures, so we'll discuss it more in the next video. There are other loss functions, mean squared log error adds a log function to the calculation of the difference. This results in underestimates being penalized more than overestimates, and more importantly, evening out the effect of mistakes, not being as sensitive to outliers. It's useful when you care more about relative mistakes than the absolute magnitude of your mistakes. Another measure is R^2. R^2 looks at the variation or noisiness of the model labels. In particular, it modifies the same square error measure as MSE, but divides by the noisiness in the labels themselves. This means R^2 is normalized, and it lies between 0 and 1, with 1 being the best score. R^2 is usually used for interpretation without knowing the scale of the data and it explains how well the variation in the data itself explains the variation in the predicted values. So now you've seen a variety of error functions for regression and have developed some insight into when each is appropriate next. Next we're going to describe different error measures for classification. See you there.