In the last module, I spoke about machine learning problem types, such as supervised and unsupervised learning, and covered key components within supervised machine learning, such as standard algorithms, data, predictive insights, and repeat decisions at scale. Now that you know the individual building blocks, we're going to shift back to the phases of an ML project. In this module, I'm going to focus on key considerations from preparing your data to deploying your trained ML model. In the first topic, we'll cover features and labels in depth as part of the data collection and preparation phase. Next, I'll teach you some of the ways to obtain or build labeled datasets. Then I'll go over the considerations for training an ML model using your data and of course, you can't use an ML model without first evaluating it for accuracy, so we'll cover that too. I'll close the module by offering you a few best practices, and I'll introduce your first hands-on lab. Let's jump in. In Module 2, I first introduced these high-level phases for a machine learning project: Assessing the ML problem, collecting and preparing data, training an ML model using pre-selected metrics and objectives, evaluating and validating the model, and finally deploying the model. In Module 3, I talked a lot about the importance of collecting volumes of high-quality data which affect Phases 2, 3 and 4 of an ML project. Now, suppose you've decided upon an ML problem and collected your data or realized you already had it. You're ready to train an ML model. But before you do, let's talk about what's involved in preparing the data. For classification and regression ML problems, you already know that to train an ML model, it needs to learn from lots of examples or labeled datasets. In fact, the more examples, the better the ML model learns. An example or the input data has three parts: features of the example, the resulting label or classification, and the label type. Let's look at each in turn. The features are brief descriptions that give context or meaning to a piece of data. In this case, features of a leaf are yellow, small, spotty, and so on. We're used to features in the context of products. For example, a new camera feature on your phone. But in this context, feature simply means a distinctive attribute. Features are used to then identify the resulting label or classification. Going back to our example, if the features of a leaf are that it's spotty or yellow, then the resulting label or classification would be ill. If the features are full shape and consistent color, then the resulting label is healthy. Finally, label types can be either numbers, or categories, or even phrases. For example, $10,000 is an amount of money that someone deposits in a year, which is a number. Whether that amount is high or low, is a category. In the communication between the bank clerk and the customer, a freeze in the email might be auto-populated. For instance, let me know if you have any questions. Let's put these terms into practice with an example. Suppose you wanted to use machine learning to predict the price of a house. What is the label in this scenario? If you set price is the label, you are correct. Remember the label is what we're trying to predict. Next, what is the label type? Price is the hint, so it's a numeric label type. What are some relevant features? Here are a few features: location, number of rooms, square footage, home styles such as bungalow, detach, semi-detached, the school district, and whether or not there's a basement. Here's a bonus question. What ML problem type is this? Classification or regression? You guessed it. This is a regression problem because the label is a numeric value. Now, even though labels are necessary, choosing the right one isn't always obvious and not surprisingly, choosing the wrong one can negatively affect your model. Let's use a banking example. A bank might be interested in predicting how much money a current customer will deposit over the next 10 years. That means we need data of customer deposits that go back at least 10 years, but this poses a lot of problems. Did people earn the same amounts over the past 10 years? Was there a change in the customer's income? If the customers current income increased or decreased compared to their average income over the past 10 years, the model will likely make wrong predictions. Was there a recession before? If so, then the period might not be a good predictor of the immediate future. Did our bank open new branches? If so, customers in the new branch might be different from the old branch. All of these make it very difficult to create the right dataset for this problem, and the next 10 years will be very different from the previous 10 years. Just how difficult. Well, to create the label dataset you need specific features. For example, how much money was deposited in the last 10 years? What was the economy like for each of those years? How many customers did we have in each year and for each branch? Very quickly you notice that you need the data for several other factors to then set labels for the next ten years of deposits. This is still very difficult to predict with accuracy. Understanding that ML depends on examples and that you need labeled examples will help you formulate the problem in a way that allows it to be solved. After all, you don't want to promise your boss that you will create a 10 year model and then find out the hard way that it doesn't work as well as you'd imagined. I will return to this subject of how to take a business need and create a well-formed ML problem in an upcoming video. Let's continue with building labeled datasets with the next video.