In this video, we'll describe how to find a best fit line for a given set of data points. You've probably seen this done before, maybe even with a ruler on graph paper. In the context of machine learning, this is known as linear regression. We'll look at how linear regression works in two dimensions, and then show you how to use scikit-learn built-in function for linear regression models. Before we begin, this video relies on some of the math introduced in the math review reading including lines and derivatives. If those are unfamiliar to you, you might want to go back and revisit that reading to help you understand this video. Throughout the course we've been using the concept of hypothesis spaces. Remember that a hypothesis space is a collection of hypotheses that might answer a particular question, and learning algorithms find the best hypothesis in the space of possibilities which we'll call the model or quorum. We talked about how classification learning algorithms operate in the space of functions that classify examples. However, we can also think about the set of hypotheses that predict numbers rather than categories. When that's the case, you're in the realm of regression. So the space of all arbitrary functions that return numbers is a pretty big space. One of the simplest subsets of this hypothesis space is the set of linear functions, straight lines. In this case we're looking at linear regression. In the case of two-dimensional data points, this means finding the straight line that fits the data we have as well as possible, the best fit line. As an example, this set of five data points represents the number of ice cream sales based on the temperature outside. Let's say we've run linear regression to find our best fit line. We can represent this particular line as the function 10 times x plus 10. For any given temperature, we multiply by 10 and add 10, and that gives us some estimate of how many ice cream sales we can expect. When we have this line, we can forget about all of our training examples and use the line to predict the number of ice cream sales for a particular temperature. This technique is known as linear regression. One benefit to lines is that they're really simple, we can understand and control them more than other more complicated models. But even so, why did I draw this line instead of that one or this one or even this other line? Even though lines have a simple structure, we still need some way to pick the best line. How can we do this? As humans we can do this fairly well in two-dimensions, but a computer doesn't have your understanding of visual space, so we will have to somehow tell the computer how to choose a particular line out of all the possible lines. Again, this amounts to narrowing down our hypothesis space to one particular hypothesis or model that best makes predictions on unseen data. To do this, we'll develop a mathematically precise definition for what we mean by best, and hope that the best predictor on our example data will be best on new examples as well. A good way to go about finding the best fit line might be to measure each lines relative badness instead of trying to figure out its relative goodness. We do that by penalizing bad lines and then have our computer choose the line that has the least penalty associated with it. This penalty will be some measurement based on mistakes, the difference between the value predicted by the line and the value given in the training data. Functions that quantify mistakes are known as loss functions. Now we get to define a good line as the line that minimizes loss as much as possible. We're looking for the line that makes the fewest and smallest mistakes. Let's formalize this idea with some math. We'll keep with the example where our ice cream sales model, the best fit line over temperature is given by 10x plus 10. X is the temperature and the label Y is the number of ice cream sales. First, for each labeled example in the training data, take the difference between the y-value your model predicts for that x and the actual y-value for that example. Our first data point is 575. For a temperature of five degrees Celsius we made 75 sales. Our model predicted 60 sales but the recorded actual value was 75, so the difference between the two will be minus 15. Similarly, for our second data point the predicted value was 110 but the actual value was 50, so the difference between the two is 60. We're going to sum up all these differences but notice that they might be positive and they could also be negative. We don't really care whether we were predicting too few sales or too many, we just want to get it right. We certainly don't want our underestimates canceling out our overestimates. In math terms, we want to capture the magnitude of how far our line is from the training data. So we're going to square each difference to ensure that our measurement is positive. You might be wondering why we chose to square the difference between the predicted value and the label instead of just taking the absolute value, and I promise we'll get back to this in a later video. For now, just go ahead use the squared difference for each of our data points and then sum them all up to get the least squares error for the entire line. Instead of writing out all the sums every time, we'll use the sigma notation to describe the same thing. So here's the notation in its general form. Take the sum of the loss which is h of x minus y squared for every x, y example in your training set. This just so happens to be a very common penalty function known as the L2 loss function or least squared error. The L2 loss function like any other loss function, measures how far off our line is from all of the examples in the training data. So back to our example, the loss of our 10x plus 10 line is 3,150. Of course, we can calculate the penalty the same way for any line we want. Over 3,000 might sound like a lot for what we call the best-fit line, but it's all relative. Let's compare it to the error of the other three lines we saw earlier, 9,000 sum, 8,000, 13,850. We can see that our line has the smallest loss of any of those lines. So given any set of lines, we can compare the penalty for each of them and then pick the line with the smallest loss. Great. Now when I give you that set of lines, do you know how to determine which one is best from that set? But we're not done yet. In the next video, we'll discuss how we can find the best line without having to calculate the loss for every possible line.