So welcome to module one of framework for data collection and analysis. In this first module, we'll walk you through the entire data collection and analysis process. And I will point out, not only different data sources that can be used, but also point you to the various courses that we offer in our specialization, that shed more light on each of these particular steps. We'll then continue with the framework that allows you to evaluate each data source separately. In these Coursera courses, we have many different people. They are researchers, statisticians, methodologists. They don't know about survey data for example that joined this course. But there are even more domain experts that have a particular research question in mind, that they tried to answer, and they're looking for guidance on how to do that. Over in the questionnaire design course, we've seen this. From all over the world, people with very, very different interests, some in health, some in labor, some in transportation. It can be anything, and of course a lot of market researchers. Now the researchers that help you get the right design and the analysis have to work closely with domain experts. And the larger your data sets get and the more complicated the system your IT infrastructure is that you have the more important is that you also work with computer scientists and administration people. So we hope that this course is one of the many that you will take that help you learn each other's language a little bit, be better able to communicate your needs or even do it all that you need yourself. But most likely it will be a joint project between people with these various skills. So let's look at this process of data collection and analysis. It all rests on this basis of a research question. Without a good research question, you will not be able to do good research. I've seen countless times that people just had a lot of data and they weren't sure what to look for. In the next module, I'll come back to that in particular because it's quite striking what happened when you don't have clear ideas, what you're looking for or what you expect to see. Research question aside, data generating process, I spend a lot of time talking about the data generating process. It's important that you understand how important it is to understand the data generating process. In fact, I would make the claim that you can't do any inference without knowing that process properly. Then there's a lot of data cleaning, data curation and data storage necessary. Here, IT infrastructure and system administration people can help you quite a bit. This class is not about programming. I'll show you a little snippet of what is important for this, but by and large, the other courses for example in the data science specialization over at Hopkins, they have a whole course on these kind of things. Our focus here is much more conceptual, planning a good study, knowing which data sources either to take or are existing, what to collect and design the collection, and then of course there's the data analysis piece. Some of you might have had introductory statistics courses long time back. You'll remember that you didn't like them. Well you love them, but they were really theoretical, not that applied as you might need it here, or they used very different data. Now, in this course, we have, in this specialization, we have one segment that deals with data analysis. In particular for survey data, because you often encounter issues of sampling issues of waiting and the complex with hierarchal structure that really makes a different type of analysis necessary. And so that course will cover that, but I'll come to that and you will see on the course platform information on this particular peace. And then finally there's data output and access. In this course here, we'll talk a little bit about ethical issues involved. And we'll come back to that later in some of the other courses. Now I've already spoke a lot about this process here, so why talk about each step even more. But, let's see. So, if we think of a research question, we can separate these out easily, in three different categories. The first one, on the left, is a descriptive category. We're trying to show means, percentages, for certain subgroups. Here for example, three groups. We don't have an x-axis, so we don't know what these groups are. We don't have a y-axis, the vertical one. We don't know what the values are. It's just an image capturing that this is a descriptive statistic. You're trying to put out some numbers about the world, about a piece of the world, describing certain variables or features as they appear in the world. Then there's a second type, that I would call causal or interest research questions deal with causality. So this one here captures all things that have to do with wanting to know whether a certain treatment, for example, taking some pain medication helps against headache, and makes you smile. We could teach a whole class on this issue, because the question is, what really, can we really measure a causation, right? With humans, it's often hard, if not impossible, to see the counterfactual. What would have happened had I not taken that pill? Almost impossible to do. Well actually it is impossible to do. There's always something that varies. However, there's of course, a large enterprise, not only in the medical field, but also increasingly in economics and public health, where that is a research question in their data experiments that are collected for this. Finally, the third type of data are predictions. What you see here is a graph from the Energy Information Association. They put out beautiful graphs. I recommend a visit to their website. And you see the prediction of the oil price in the US. Now what I would argue here, is that no matter whether you deal with description, causation, or prediction, you always most likely have some kind of inferential goal in mind. So what does that mean? There is a larger population, a longer time frame, a larger geographic area that you want your results to hold. So when I describe a population, let's say I want to give a description of percentages for subgroups for the US, then of course I have to define what do I mean as US population. And more likely than not I won't be able to get that kind of data for everybody in the US. In part because people are born in the eye all the time. And so, I can only have one particular segment in point in time, even if I were to be able to do a census. And just, scoop around on the census webpages and you will see how hard it is to do a census to begin with. So more often than not people take surveys, or better yet I should say sample surveys. So having a smaller set of data from a larger population in which case you want to do inference to the larger population. The same is probably true with causation and prediction. All of them rest on some form of data. More likely than not, these data are sample, either in geographic space or in time or because they are subset of the population. Even if you would have access to all of Twitter data, you could possibly run your analysis with all in data because they're constantly coming in. Now, computing, software, everything gets faster, so you can update the analysis. However, there is always some limit to the processing power. And, it might be necessary to look at the results once in a while. So in some form or another, whether it is time, space, or the number of people, you will end up doing some sort of inference. Now the requirements on the subset of data on this sampling, or selection, of which cases to look at, which units to look at, those are a little different depending on whether we deal with Description, Causation, or Prediction. For description, I would argue, that what you need is a positive and known selection probability. So in order to really be able to say 25% of the population are interested in Coursera, I would need to know A, what was the population? But also, I would want to make sure that everybody from that population, however I defined it, had a positive selection probability. If I systematically exclude, by the way of how I collect the data, certain parts of the population from giving me their interest, or reporting their blood pressure, if I'm interested in the median blood pressure of people living in the United States, then I run the risk that I have biased results, that my results are not a correct inference for that entire population. I also should know the selection probability because if I only take a part and in particular, I want to estimate totals, I would need to weigh the data up, so that they concur with the values in the population. So for this particular segment, positive and known selection probabilities. Being able to know for everybody that they had the chance to prorate the data. That I know what the probability is for that is important. What we do with that knowledge? We'll learn in the sampling class and in the course of odd missing data, because obviously in practice, that's never gonna be the case, right? There's always some cases missing for some reason or another. And there are techniques on how to overcome that problem. Now, for the other two, this causation and prediction, the known selection probabilities are a little less important. Maybe not important at all, actually. What I do want to know though, is that I still have a positive selection probability of everybody. Only then would I feel comfortable to make that kind of inference. Now let's step away from people for a second here. You could think of this differently, if we think of Twitter feeds and sentiments expressed on Twitter. If I wanna make an inference to all possible sentiments, then you have to ask Does the database, the body, the corpus of data that i am using, have positive selection probabilities for all possible sentiments? In which case, i could make inference. Even if the people at the sentiment are not a one-to-one match. And that can be true for jobs, or diseases, any of that sort. So it is important that you do know what units you are looking at. And what units are on your population that you want to make an inference to. And then when you talk about the data, you should talk about these aspects, because otherwise people might get mislead by the data that you provide to them. So much very broadly on the research question with it's three categories will continue with the data generating process, some basic ideas on what to think about here.