Now ahead is one of the most critical modules to pay attention to even if you're a SQL guru. Here we're going to look back at the evolution of modern databases and end with how the technologies behind BigQuery address some of the limitations and architecture that prevented true petabyte-scale data analysis. We'll then discuss a core database concept called normalization and end with some pretty cool data structures like having nested records inside a table. Now let's start with a database evolution journey first. Let's dive right in. A lot of these concepts, we're going to step away from a lot of the SQL syntax that we've learned before. We're going to think more about database theory and the evolution of databases over time. The example I'm going to walk through is our IRS data example that we've been using for all of this course. Let's take a look at this schema in the way that it's actually starting these expenses. We have over 20 expenses that are recorded for each of these charities. You can have legal expenses on their forum for 2015 that they're filing, accounting expenses, lobbying expenses, royalties, travel expenses, grants. All of these data fields need to be stored in some structured row and column format. Let's take a look at how we actually have done it. We being the IRS and us ingesting that data into BigQuery. That form data is actually pivoted in a way where instead of being a list of expenses, this is just an increasingly wide table of all these different fields. If you query the IRS 990 2015 table, this is exactly what you would see. For each record, you're going to have one EIN in a whole list of all the fields that are available that have been filed on that 990 return with all the different field names. These are some of the examples for the expenses. You can see legal fees, accounting fees, fees for lobbying, office expenses. Now, here's the problem with structuring the data this way. What if I asked you for 2016, there's 10 new types of legal expenses that I want you to add to the IRS Form 990 for charities to be able to fill out. What would that mean for your data architects? If the PDF form that charities have to fill out has 10 more fields, no problem for them but you get that new form in. What about all this old data that you have? What about the schema that you already have? It's going to make fundamental changes. You can technically go wider and create new columns. But think of the downsides to that. Anytime you're doing unions on the past, you're going to have tables that have mismatched columns. That's the least of your worries. The biggest part is when does it end. Maybe some years lose expense columns, so you're shortening the schema and then maybe some years have many more columns and you're just widening it. Constant changes to this schema is bad design. We need to find a better way to cope with flexible amount of fields we want to store. The answer is, it's potentially not having them stored as fields at all. As we cover, that's not really that's scalable in either direction. You don't want to make continuous changes to your schema over time. You want to be flexible. In the traditional relational database world, let's talk about what one of the potential solutions was. What you can do is you can actually break apart your expenses into a completely separate table. Instead of storing everything in one massive, what I call the mother of all tables, you can actually begin to break apart those fields as gross in another table. Tables are much more flexible vertically by adding and appending new rows than they are horizontally by continuously adding and removing fields. That's no fun. Let's cover this example. Here we've broken it apart into three separate tables since then have one massive table that has all the different field values. You have the organizational details in one and that's the item in blue. Then in green, you have the historical transactions. You can actually look up those expense codes from your expense code lookup table. This basically says expense code one is lobbying to as legal theories, insurance. It may be 50, could be travel or something like that. You can have as many items in this lookup table as you want. Then the historical tracks transactions could just look up against that. You might be wondering why we don't have an expense code at all. Then this gets to another point of database theory, which is try to have one fact in one place. If you end up renaming something lobbying and renaming the word lobbying to lobbying services. If you had that repeated tens of thousands of times in the transactions table because you didn't have it broken out into a lookup table, that's right for error as well. Generally, you just want to have as much separation as you can. But again, we'll talk about when, how far is too far. This entire process that we cover is called normalization. Normalization is in the relational database world. This is again apart from BigQuery and apart from any specific technology. This is just database theory is breaking apart. All of your massive data tables are centralized, consolidated data tables into these separate but related tables. This is the foundation of the relational database model. What happens if we go too far? We're going to talk about that in normalization versus denormalization. Breaking apart but then potentially coming back together. Let's talk about some of those trade-offs. What's the benefit? We covered this. The organizational details table can now grow if you have new charities that are joining every year or removing every year, boom, that's fine. Just grow as much as you want. In the transactions, you have expense codes or you have any kind of code, it might not even necessarily be an expense. Then you can look up against ever-changing listed expense codes that you see there. You have much more flexibility and you don't need to grow horizontally and change that fundamental schema, which you absolutely want to avoid at all costs. What's the downside? The drawback is if you want to do a consolidated query, like give me the lobbying expenses for the New York Association, Inc. Company. Now in order to tie those insights back together, as you've learned how to do in SQL is those joins and unions and merging the data together. Joins are now a necessity as part of the relational database model. Joins for performance reasons, joining data across multiple different tables, and then doing that lookup is a big performance hit. But again, if you're relating database tables together across different and disparate data sources, you're going to have to use joins. There's no other way out. For BigQuery, we're going to cover a particular answer to this relational database problem. We now have data pulled from three different sources. Not ideal when it comes to performance, but for a squeaky, clean database architecture design, this is what we want in theory. What could we do? We could actually say, hey, well, you mentioned that having a separate lookup table for individual expenses, it just maps this arbitrary code. The expense type is a good idea. I don't want to do that. I don't want to have a third table. I'm going to delete that third table. Instead, I'm just going to actually have the expense code be spelled out for what it is. You trade a little bit of your included some creeping redundancy potentially for duplicative fields. What you've done here is potentially you've traded redundancy. You've added a little bit of redundancy, which is potentially bad in this case, you could ever repeating expense goes like lobbying, lobbying, lobbying. What happens if you want to change lobbying to lobbying services again? Again, you have to change that for every row there in favor of a performance boon because you're no longer joining across three tables. You're joining across just those two. Here's another issue. Growing a relational database at scale vertically does have its limits. Traditionally, if you're trying to scan across millions or billions of these charities, if you're doing a secret scan, that's traditionally very hard and computationally expensive when you have things in the order of billions. Having one massive table, think of this like a spreadsheet. You have a massive vertical scrollbar because you're just trying to capture all that information in one scene or table, which does have its benefit. It means that all facts are in one place but if you're talking in a billion-row level, it's not exactly efficient to begin sorting and scanning through a lot of these massive vertical tables. Again, it's not BigQuery in a relational database realm, probably if you're a SQL grew, you've heard of these before, but indexes, these presorted indexes, which are like a separate artifact were created. For common queries if you wanted to get the ranked order alphabetically of all these companies and you can imagine this at scale, I'd say there's 10 billion records here, however many. That's a very, very common query. It optimizes read performance because you can actually read from this index much, much faster, because it's presorted because, hey, we're getting a lot of folks that want to do this normal ranking of these company names, we're going to create this index so we can actually reference and have that be much faster. You're trading read performance, read meaning select company name from this at the expense of write performance. If you had a changing list of company names, you'll have to update that in two different places now. If New York Association, Inc. was no more, you have to delete that from both the organizational details table and your index table would have to be updated as well so there's a trade-off there. For BigQuery, both the concept of scaling out into multiple different tables and the idea of building these indexes to handle massive amounts of rows and tables is fundamentally handles, are completely different way. I'm going to talk about that next.