Let's do now a lot about how to improve data quality. The idea is that we will be working with one data set that has some kind of dirty data. And we will see how we can make this data more useful so things that we will do we will see how to resolve missing values. We will see how to convert data feature columns to a data in format. We will create one hot encoding of categorical features. So let's move to run this lab. This lab is again one of the labs that we have available for you under the get have ripple that is public that is called training analysts under Google Cloud platform. Remember that we already create a notebook. So if we go under vertex AI Workbench here we have our notebook. And we already clone this gift have repo into our notebook. And remember again the path here we have cloned all the repo is this repo when we are here now we need to go under training-data-analyst courses machine-learning, deepdive2. And in this case this improved data quality lab is under the folder launching into ML. So let's go to the solutions and open the improve data quality. Good as always, I will restart kernel and clear all the outputs to start from scratch. In this case the dataset that we will be using is the data set. That is under the California open the portal that contains information about vehicles, fuel, type, count by different zip code. So here you have the information about this public data set. So first of all we need to at the beginning import the libraries that we will use during the lab. We will use tensorflow, we will use pandas to work with a data frame, we will use numpy to do mathematical computations, we will use mattplotlib and Seaborn to create different graphs. So first of all we want to import these different libraries that we will be using during this lab. The next step is we need to upload the CSV file. As I say this CSV file is under this website, you can download this CSV file and upload it. We already have download this dataset and upload it into a Google Cloud storage bucket. So we will copy this CSB file that we have in this World Cloud storage bucket into our notebook. So to do that first, what we will do is we will create a new directory with the OS library that will be called data and transport. So once we run this cell here we create a folder that data and inside transport and now it's empty. And what we want to do is move here the CSV file that we have already under this Google Cloud storage bucket. So you can see here how now we have this sampling of this data set here to operate with. And I can see that here but also if I want to see it programmatically, I can list the files that exist under this directory and only have one file. Okay, so we are good now to continue with our lab. So first thing that we will do is we want to read this dataset and we will put this CSV file into a panda's data frame. So to do that very easy with the function read CSV from pandas. And we specify the path in which we have our CSV file and the result will be a data frame that we call the df_transport with the function head. We want to see the first five lines of this data frame by default when they have not argument in head. The system will show us five but here we can modify and change the number of lines that we want to the number of examples each of this line. Each of these rows correspond to one example on this dataset. So here we can specify exactly the number of examples that we want to visualize. So first of all, before we start to clean our data we need to understand our data, we need to try to analyze our data. So this is what we will do with these functions. So first of all we want to see which kind of features we have and which is the data type of these features and something super important. We want to see if we have some kind of nulls. So here we run this function info over this data frame and we can see here that these are the number of features that we have. There are not nulls in our data frame. And here we can see the different data types when you have a here object is because strings are in panda's appearance as object. Okay, so one thing that we want to mention here, you can see that date here is reflected as objects. That means is reflected as a string is very common that we want to change this data type to a data type format to data type instead of be working with data as I string as a data type object. Okay, so here we want to print the first and last five rows of each column and you can see here that we have 497 examples. Okay, something that is super important as I said is understand your data with the function described of the pandas data frame. We will have a lot of summary statistics from all the different numeric features because we can have numeric and categorical features. And the function described provide to us the statistical data related with the numerical features. So we can see the mean, standard deviation, the quartiles. So we can see here and analyze this data and now we want to investigate more our data. We want to group our data by the different kind of fuels that exist in the cars and we want to show only the first entry per each of the months in our data frame. So we are doing here an analysis. We are trying to understand our data. So the next step is try to clean our data. Something that is super important always is check if we have missing values and to check if we have missing values. We can use this function isnull and we make a sum of all the isnull values or the null values that the system identified for each of the different. Features. So we can see that we have some null values that we need to solve. Here, what I am doing is from the complete data frame, I filtered by the column date. So here you will see that when it's a null, we will have a NAN value. And with this other line, the system is giving to us the data frame filtering by this feature date. And when it's not null, we will have a false value, and when it's null we will have a true value. So this NAN correspond here with the true. So here we can check the same from different features. So, in this case, we need to make some kind of work to solve these null values. If we want to summarize our data, we want to have a summary of the number of examples that we have, how many features we have, which are the unique values that we have per feature, and which are the missing values. We use these functions that are super, super common to use with a Panda's data frame. So here, with the state of the first element, we will give us the number of rows and the second element will be the number of columns. So we have 499 rows and seven columns. Here, with the column to list, the result will be a list that contains the name of all the different columns that are presented in our dataset. So here we will see the list of all the different features and the label. And it's very important to know the number of unique values that exist under each of the different features. Why, imagine that, for example, one of the features is day, and you have 125 different values. This is an error. Why, because the number of days will be between 1 and 31. So this is why it's very good first to try to understand, as I said, your data. And here we can see the different values that exist for each of the different features, and the total number of missing values. So we have 17 missing values that we want to solve. In the same way that we have the function head to solve the first examples, we have the function tail to solve the last examples. By default, if we don't specify any argument, we will so ask the last five. But you can put here another number to show me the latest 10 examples from this dataset. So we can see that we have different issues that we want to solve. The first thing that we want to do, is solve the missing values. Because, as we know, we have 17 missing values. How we can solve these missing values? There are different ways. Sometimes you want to eliminate the rows in which you have a missing value, but in this case you are losing examples. So what we are doing here, we will input the value of these missing values with the most common value present in the rest of the examples. And how we can do that? To solve this issue, we are using a lambda function, super easy. So when we apply this lambda the function, the idea is that each of the columns, if the value is null, we'll input the most common value for this column across the complete dataset. So once we make this function and we apply this function over the data frame, if now we check for if we have any kind of nulls, we can see that the result is 0. So first task is already solved. Let's see the second one that we want to solve. Remember that here, when we see the data type with the info function, we have here object, that means that Pandas identify here date as string. We want to change that to identify date as a date-time data type. So this is what we will do next. So what we are doing here, is filtering from the date column and changing this to a date-time data type. And we specify the format that we want to use. So, if now we run the info function again, we can see here that it's not an object anymore. Now it's format date-time. And because now it's format date-time, now we can use different date functions that we have available here in Pandas. Like, for example, this function to extract the year, the month, and the day. Instead of having one unique feature with the complete date, we want to split this feature into three different features, one per year, one per month, and one per day. So, look here, at the beginning, we have seven columns. Now we have 10 columns, because now we change date and we add year, month, and day. One thing that is very common, is that if we already split date, so we create new columns for year, month, and date, it's eliminate, draw now this column. Again, we are trying to understand our data, we want to give more insights about our data. So now we want to again use the function group by to see the first example that we have available each month based on the different kind of make of this car. Good, and let's do some kind of plotting. Let's do some kind of graphs to visualize our data. So in this case, we are having the months and we are having the different vehicles. So we can see that we don't have data across all the months. We have only data across a few of the months, okay? Another thing that is important to do, is try to have consistency in the name of the columns. So don't mix upper case and lower case, and don't use spaces. So this is what we are doing here, only renaming all the columns to have everything in lower case. So here we only rename the columns with the function rename. And another thing that we want to do, if we look at this dataset, Here, we have. So let's see if I am able to find one example of one issue that we want to solve here. When I have the head This, you can see here that for the year, we have sometimes a less or lower than 2006. We cannot mix this kind of symbol with the numbers, okay? We want to have consistency in the data type that we want to use and we don't want to, for example, in this case use these examples. Another way that we can do it is transform this column. So if we work with model here as a categorical column, we can have out of vocabulary list and all of that will be under this out of vocabulary bucket. But in this case, we want to eliminate all these roles And this is what we are doing exactly here. In this case, we want to create another data frame that will contain from the original data frame all the data. But we want to eliminate all the data that is with this modelyear. Because we want to only take all the examples that modelyear is different than this possible value. Okay, so here you can see that we eliminate this less than 2016. Good, last part that we will do here. When we are working with a machine learning model and when we are working with a neural network we are working with a linear regression model. You need to think that all of that will make a lot of mathematical computation. We'll make mathematical operations. So the model expects numbers as inputs and we have a lot of time features that are categorical, yes or no, female or male. In this case, for example, if we look at this type of fuel, so what's happening? We need to transform these features that are categorical in a way that the model is able to understand and it's super common to use one-hot encoding. And this is exactly what we will do here. For example, in this case, Lightduty we have two possible values yes and no and we will do one-hot encoding to transform these values into ones or zeros. Zeros correspond to no one correspond to yes. So if now we look at the head, you can see here how lightduty is transformed if we remember before we have lightduty with yes and no. So what we will do is do this one-hot encoding for all the categorical variables. Here, I am using a lambda function to do the one-hot encoding. But we can use a super useful function that exists under pandas that is called get-damaged. When we are using this get-damaged function, we will convert a categorical variable into a one-hot encoding indicator variable. So, let's do exactly that and look what's happened here. The system automatically creates a lot of columns, why? Because if for example, we have in, I don't know, in modelyear four possible years, we'll create one independent column for each of the year. So this is why when we have each of the different examples, each of these example will contain a number one in the column that corresponds to this year and zero to the rest of the columns. So this is why when we are working with one-hot encoding, the number of columns that we have in our data frame increase a lot. You can see here that after I transform all of these categorical features into one-hot encoding. Now we are working with 49 columns and at the beginning we only have seven columns, okay? And if we examine the result, we create this data frame that we call data_dummy that contains these different features doing one-hot encoding. So for example, for every year that exists in our data set, the system will create a specific column per each possible year. And based on the original examples, if for example the year is 2010. The system will have a number one here in the column that corresponds to 2010 and zeros in the rest of the columns. So, we can see that we create a data frame with 49 columns that contains all the different one-hot encoding for all these different categorical features. So now we want to concatenate our original data frame with this data_dummy, data frame that we just create. So remember we'll start with only seven columns. Now we will end up here in this moment with 59 columns. But, because we already want to work with one-hot encoding values for the categorical features. We need to eliminate the original categorical features using the drop function under this data frame. So this is the final data set that we will use to train our model. Will be a data set that will contain 53 columns with all the numerical features and all the categorical features express using one-hot encoding. So, now we finish our data cleaning and our data preparation. It's super common that the biggest part of our time is lost during this part of the job. So now let's move to start to do the next step that will be, start to create super cool models. See you later.