Welcome back. In this session we're going to talk about the data infrastructure that companies need to have in place before they can embark on large-scale AI driven business transformation. To help us understand what kinds of data infrastructure companies need to have in place, we have with us Chris Child, who is a Director of Product Management at Snowflake. Chris, welcome and please tell us a little bit about your background in the space. Thanks Kartik, excited to be here. As you mentioned, I work at Snowflake Computing, which is a Cloud data warehouse company. I've spent my career working in data, both as an investor and now as an operator. Helping build systems that help companies make better decisions and really run their businesses better. Chris, I think a good place for us to begin is to talk about what exactly is a database? How exactly are companies thinking about all the different kinds of databases? What does it mean that evolution over the last several years? Sure. There's really two types of databases that most companies end up needing. The first is what we call transactional database. This is a system that keeps track of the important information that's running your business on a day-to-day basis. If we take a bank, for example, a bank would have a transactional database that keeps track of the balances for all the customers, and you'd use that every time someone starts a transaction to figure out if there's enough money in their account to debit their account or credit their account and keep the running balance going on. These are very useful, and they need to be very fast, and they tend to be very expensive. On the other hand, you end up with what's called an analytic database or an analytic system to process much larger sets of data over a long period of time. Continuing with the bank example, I might want to keep a history of every transaction and every balance that each of my customers has ever had. It'll be very expensive to keep this in my transactional database. I move that data to a separate database, to an analytic database where I can keep these massive long histories. Then I can ask questions like, "I'd like a list of all of my customers whose balanced grew by at least 10 percent in four of the last five years." My transactional database won't be able to answer that, but my analytical database will. To transition to analytics databases clearly would involve investments in infrastructure. What kind of infrastructure are we talking about? When you originally set up a data warehouse or an analytics database, you needed to buy a special hardware. You needed to buy very expensive special software from a variety of different vendors. Again, we're talking about 20,30 years ago, is where this methodology of storing your data came into play. As the amount of data that people were collecting from lots of different sources, whether that be from mobile apps, from websites, from marketing campaigns, or even from data you're collecting physically in your store about what's happening, or from your entire supply chain. There's a lot of different sources of data that started coming in. Those types of specialized analytics databases that ran on special hardware, become very expensive to operate and started not really being able to keep up with the performance needs of that massive amount of data. This time we went through the first big evolution of this. From these custom-built specialized Analytics data warehouses, the massive amount of data started getting stored in a new system called Hadoop. Hadoop was developed by Google to process the massive amounts of web data that they collect and track. Was also designed to run on a giant network of very inexpensive hardware. Instead of these specialized, very expensive servers, you could run on hundreds of very cheap servers. The result was this was a much more cost-effective way to manage and process these massive amounts of data. Now, isn't that really what creating a data lake all about? Also aren't we in a process of seeing many companies transition to newer Big Data tools or technologies like Spark and others? Absolutely. The data lake is what people use to refer to basically massive sets of hard drives that they're storing all of this data in. It's a place you can pour huge amounts of data like a lake, and then you can use tools like Hadoop or now Spark, which is a much more modern version of the Hadoop computation engine, to pull data out of that lake, run some calculations and transformations on it and then put it back so that you can find it later. Traditionally, what we saw is once people would sort of finish all those calculations, they wanted to be able to query that data very quickly. They would end up putting that into those data warehouses that they were using originally. Now they started to refer those as data marts, which was where a small set of your customers or of your internal users could go get a subset of the data. But in order to get a new data-set loaded, you had to go back and write Hadoop or Spark jobs and get that data transformed and loaded into those data marts or data warehouses. For those of you who are not familiar, Hadoop and Spark, these are techniques for storing and processing large amounts of data. Essentially they involve distributed storage, distributed processing of the data, and creating lot of parallelization which helps the data processing to happen faster. Now, coming back, Chris, we've also now seen a transition towards Cloud data warehousing. Can you set up what exactly is a data warehouse? How does it fit in within this whole conversation of companies moving the data to data lakes and data marts? Absolutely. A lot of people found that with these data lakes and data marts, it was still hard to keep track of all of your data. It was in different places that were massive amounts of it, it was in inconsistent formats. Accessing it often involves having your engineering team actually write code that could run on these large parallelize systems. About 10 years ago, a lot of research started happening into what are now called cloud data warehouses. These are systems from Amazon or Google or from Snowflake, which are a re-imagining of the traditional data warehouse. They're designed to run on massively parallel sets of inexpensive hardware, like Hadoop. Generally, they're run on hardware that you rent from cloud providers like Amazon or Google or Microsoft, instead of having to manage those servers yourself. But from the outside, they look and operate and have the performance of a traditional data warehouse. What that means is they use a language to speak to them called SQL, which is what the data warehouses and databases use. This means you can natively use Tableau or Looker or other analytics and BI tools right on top of them. Because they use that standard language, they also integrate well with large sets of tools. As we were talking a little bit about before, what you really want from your data platform overall is somewhere to store all this data. You need a set of ingest tools. How do you get the data into your data platform? Being SQL based, you can use any of a wide variety of tools that are built specifically for that. You then need a set of transformations. It take the raw data that's coming in and turn it into something useful. As I'm sure you've talked about in this class, one of those techniques is machine learning that you can use to take this raw data and score it and make predictions and figure out what's going to happen. But there's also simple things like I might be getting data about the set of actions that users taking on a daily basis. Really I want to look at that on a monthly basis. One transformation would be rolling that up to a monthly basis. The final piece you need is a query and Visualization Engine, as we mentioned, Tableau or Looker, other tools like that. A way to actually run queries and for your analyst team to build dashboards and basically ask questions of the data once it's been transformed. One of the big challenges that people had with Hadoop or even with the Spark based ecosystem, is that those tools often need to be custom-built for that ecosystem. Whereas, if you use a Cloud data warehouse, you get high-performance, you get the scalability of Hadoop, but you also get access to the standard ecosystem of tools. Chris, when we started our conversation, I mentioned that before companies can start using machine learning or other predictive technologies, they need to have a data infrastructure in place. Now, putting this data infrastructure in place obviously costs some money and cannot be taken lightly. What questions should a manager ask before they embark on such an exercise? Absolutely. One of the mistakes that I've seen people make repeatedly is to think that having this Data Infrastructure in and of itself is an important thing to do. They'll set this up and the load a bunch of data and they'll buy a bunch of tools, and then they won't actually get any value out of it. Because what they didn't do was think ahead of time about what they were trying to solve, what problems they had that they wanted to solve with data. What I would suggest as anyone who's going to undertake this journey, think first carefully about the types of questions that you wish you could ask, but you can't because you don't have all of the data. The types of questions that you are answering today, but it's taking a long time. An example of that is anything where you ask someone on your team to go spend two weeks collecting data and running analysis in Excel. Those are decent candidates for the types of problems that you could solve in minutes if you have the correct data infrastructure in place. Finally, think about what data you need in order to answer those questions. It's generally not that useful to go collect every single piece of data you can possibly think of. Instead, what are the pieces of data that are important to your business and are going to help you answer those critical business questions so that you can run your business better and more efficiently? Is really at the end of the day, that's the whole goal. Chris, that has been very helpful set of depths and overview of the data infrastructure companies need to think through. Thank you so much for joining us. Thank you, Kartik. I appreciate it. Thanks for having me.