Now one of the most critical parts of data analysis happens way before you build your first visualization or your machine learning models. While not necessarily the most glamorous, data preparation and transformation can't be ignored. In this module, we're going to cover the elements of what makes a dataset good, and then look at two different ways to process, clean, and prepare your data. One, as you might expect, is through SQL and BigQuery, and the other is through a graphical user interface in Cloud data prep. The majority of data quality lectures will begin with a statement like this, garbage in, garbage out. Now what that means in our specific examples or if you want to build an amazing machine learning models, you can't feed him with garbage. You've got to feed it with really good data to begin with. Again, if that is your end goal to build awesome ML models after you've mastered this course and the Data Engineering course after this, then this is your real first step there. You have to get extremely good clean data in order for something like an ML model or even just insights from SQL to begin to even process things at that caliber because ML specifically relies on consistent patterns of data. If you start feeding it garbage data it's going to potentially overfit your models and do a whole bunch of other bad things that you don't want to do. In order to prevent that, let's talk about some of the things that we can do. High-quality datasets follow these strict rules. There are five of them. They're valid, they can form those business rules. They're accurate against the true objective value. They're complete, meaning that you can see the whole picture. You're not just getting a subset of the data, not being fooled. They're consistent, meaning that you can derive actual insights that you can rely off of, and they're uniform. There's a fun example that we'll go through for each of these. First up, validity. Data follows constraints on uniqueness. What does three of these images have in common? They're all unique identifiers. The telephone represents a unique number where I can reach you, you can reach me. If somebody else had my same number, I'd be worried. I don't even know how that would work. You would call me and two people would answer. It just wouldn't work. If the same principle that that sounds ludicrous, having multiple physical addresses be the same or people sharing the same license plate presents a massive problem in the same way for your data when we talk later on in the course is about joining data together, or like we did earlier about having the EIN or the charitable organization that multiple filings within the same year. That'll present complications like we talked about. Like when you're summing things and making assumptions about all the different individual records, if they're not unique, you can run into some serious complications, especially when we talk about joins later and unintentionally crossed joining your data between the cover. Second up, valid data corresponds to constraints of ranges. Here we go. Which values are out of range? See if you got some dice and you've got the rolls here, the occurrences. If you're staring at number 7, that could potentially be right. Say you're rolling one die and is it possible to get a seven? No, it's not possible. The seven is out of bounds. Alternatively, say since I did mention dice, you rolling two dice. What would be the out of range value here? That would be rule number 5 with a value of one. If you roll two dice, you can't have one as one of those values. The metapoint here is that if you're able to know ahead of time, if your data smells funny, or if after exploring your data in the exploratory analysis that you performed, you find that, hey, this should always be x or this should always be y and it falls into one of these ranges, you can set those up much like they do in software development as a test. You can do that within SQL, you can do that within one of the other tools I'm going to show you called Cloud Data Prep. Any of these values that are making your dataset weird or cause you pause, you can have automatic notifies that basically say, hey, this dirty data right here, I'm going to go ahead and clean it up and make it transform. In this particular case, you could say, well, if it's one, I want to exclude that complete value. I don't even want to include that roll at all, or maybe I want to completely invalid the results because there's something obviously wrong with the dice here. There's many things that you can do, but setting up that validation in an automated and repeatable way is one of the end results here. Moving on, data has to match a source of truth. That's what makes it accurate. If you're in the United States and you're presented with this list, Washington, Oregon, California, hot dog, Florida and Maine, you're like, wait a minute. I don't remember a state called hot dog. It doesn't match what we know as a source of truth. It's a string and matches in terms of data value, but against a known lookup source, it doesn't necessarily match. Again, there's debates over, as you're going to see in your Cloud data prep lab how many US states are there? Are there 50 as it's commonly known or are you also including US territories, or post offices that exists potentially in other countries. You'll see that come up again soon. What about this? We're going to talk about completeness. Thanks in large part to the ubiquity of probably the image on the right. You'll be able to discern that A is actually Big Ben. But here you're comparing a subset or a sample of your data and trying to make insights or assumptions about the entirety of your data. From here going back to the last line, you could say, my dataset is a bunch of just clocks and lamps. They look funny and they're low resolution. But when in reality, the greater picture of your dataset here, because it's pixels are just numbers, is much more complex. No, this is location. This is Big Ben, this is in London. It's funny how we can immediately notice something that's wrong with that picture. A picture is worth 1,000 words. But within a dataset, if someone hands you a CSV file or a bunch of JSON, then we generally just, unless we're practicing too, we don't ask the question of, is this all the data? Is this a subset? Am I missing fields or rows or columns? Is this complete or do I need to do additional work to make sure that I'm collecting everything that I need to collect? Especially true when you're mashing up data from a variety of different silos. Don't be afraid to ask those questions. Is my data complete? Do I have everything? Are all the rows accounted for? Last but not least, we have consistency. This is the dilemma here where you have potentially two owners for a house. You have two different tables, the house address 123 ABC street, and the owner ID of 12, and even owner's table where it has owner ID of 15, but their address is 123 ABC street. In database 101 we call this a referential integrity issue. Who actually owns that house, whose name is on the deed? Having consistent data, we call this one fact in one place. It gives you that harmony across many different systems, especially when it comes to joining this data lat. You don't have any issues. I lied. It wasn't the last one. Uniformity was the last one because this is the example that I really like to share. There was a Mars Climate Orbiter, and just before the year 2000, 1999, NASA unfortunately lost 125 million dollar Climate Orbiter as it burned up into the surface of Mars because of an issue with one team using the English system of feet and yards and another team that was working on a different part of this system using the metric system. While it seemed like a potential innocuous problem to begin with, like everyone's just English versus metric who's right, at the end of the day when it amounts to building such a complex system like this, you need to make sure that everyone is using a uniform system of measurement.