[MUSIC] Hello, in this video, we'll talk about some data sources that will give you population level data, and they do exist in the US. If you live out of the US, there might be other data sources for your populations that you can look at. So the key takeaways for this video are to describe the advantages and disadvantages in using all of these various data sources for health analytics. Explain the added value of some of these potential non-traditional data sources, going beyond insurance claims and EHRs. And also explain the features of a number of these available population-wide data sources. So there are a number of factors that affect population health data sources. For example, there has been an increased end user adoption of health IT solutions in the US. More specifically, all of our providers have adopted EHRs. All of our patients or consumers have adopted smart technology that can help us to collect more data on a population level. There's also an expanding data variety that's generated by all of these emerging health data sources. And how to reconcile data from EHRs, some personal health records, mobile health and so on is going to be a challenge. There is also an increased enhanced continuity of data flow among these health IT solutions. And that is mainly because we're getting better interoperability of standards but we are still not there in terms of making it very fluid and perfect. So it's still a long way to go but it's much better than a decade ago. And of course, there's an increase coverage of health IT solutions in general, meaning that different localities or almost all states in the US. Have a wide adoption of different population-wide data repositories that could be used for different health analytic purposes. So here is one example, you can look at this diagram showing the EHR adoption among non-federal acute care hospitals in the US. You can see, going from left to right, going from 2008 to 2014, the adoption has gone up a lot. Actually, as of today, almost 100% of our hospitals in the US have a certified EHR. Now EHR adoption also has gone up among office-based positions. You can see, as of 2013, almost 80% of our office physicians, the outpatient settings have also used a certified EHR. There are a lot of different data sources. If somebody really wants to see the entire picture about a population, they need to collect data from many sources, and that is tough. Here in this diagram, if you see at the center, that's where the patient and physician interact. But that's just a fraction of time throughout the year that the patient or the physician actually interact with your own circles or their own communities. You can see patient interacting with their family and their community. And physician interacting with their practice and their bigger network of an integrated delivery system, their accountable care organization or so on. And depending on where you stand and what sort of a lens you have to look at the different source of data. You can see EHR data here, like electronic health record, it's more to the left, that's more clinical. We have the health information exchange data, we have insurance claims. We have national datasets that you know the federal government sometimes collects. This is all on the left side but there is a lot of data also growing on the right side. And these are personal health records, biometrics, tele-health, mHealth, social network, market data, consumer data and so on. And if somebody truly wants to do population health analysis, you would need all of this data in one place, which doesn't exist as of 2018. Now, you can also look at all of these data sources as a continuum of data sources, you can see there are these big circles going from left to right. We have the private or the business enterprise, we have the patient, the provider, the population or the payer, and also the public health. And depending on where you are in this spectrum or the continuum of data sources, you might see consumer health data, EHR data, claims data, or surveillance data. So you have to be aware that when you use one data source, you have only looked at that continuum of data sources or that interaction. And all of your analytics is as good as your data, so you have to be very cautious on what data source you use for what analytics. So as I said, there are some data sources with wide population coverage here in the US, and here is a short list. You can see there are some consolidated insurance claims or large insurers that have data on hundreds of thousands or a dozen million patients. We also have centralized distributed EHR research data warehouses where entities have started sharing their EHR data. That helps in order to look at different trends in the population as needed. There are different ways to collect data from smart devices, and that's mHealth. And then look at that data across the population. Or there might be some large surveys that the US government or the state governments run. And they create these registries of data that could be used. Now, one big problem is often, these different data sources do not have a crosswalk. You cannot find a patient in a CDC survey like the BRFSS survey, that you also have their EHR data or their claims data. So that's one of the big problems we have right now, but within each of these data silos, there is a good possibility that you can find a population-level data source. Let's go through some of these, I'll try to go quickly, and you can always read the slides or the notes about this video later. So the first one is something called All-Payer Claims Databases or APCDs here in the US. Most states started something called the APCD initiative where all of the commercial payers, insurance companies, put all of their claims. They report all of the claims to this non-profit organization in that state and then collate it, they put it together. And that becomes a statewide claims database for all of their commercially-covered population. Currently more than 30 states have, or are implementing APCD, and the APCDs basically contain insurance claims data, and you can see the list here. Diagnostic procedures, NDCs of medications, information on services, prescription physician, health plan payments and so on. Typical fields of insurance claims do exist in APCDs. Here is the map of states with either an existing APCD, or they’re implementing, or they have a strong interest, or they have no plans yet. So as you can see, a good majority of the states have it, and if they share the data with you, then you have commercial claims data for the entire population. Another source is the federal insurance program here in the US, which CMS, Centers for Medicare, and Medicaid runs it, basically two insurance programs. Medicare, which is mainly for elderly population and Medicaid which is mainly for our lower income population and also disabled populations. Both of them are federally funded and Medicare is also federally managed, but Medicaid is managed by the state. So the money comes to the state and then they manage the rest of them. You can see, a source like Medicare covers almost, 45 million patients, so it's a very good data source. And very much like other claims, we have the Part A, Part B, Part D, which refers to the hospital, the professional or outpatient, and also the medication. Medicare also covers more, you have hospice data, DMEs, or durable medical equipment, and home health agencies as well. And you can see that Medicaid also covers a good number of populations of interest, including the CHIP program, which is the Children's Health Insurance Program. Now there are commercial insurance claims, There are some enterprises out there that have collated a lot of commercial claims. And they have put it all in one place for either operational or research use cases. I have listed some of them here, you can always look them up on the web and see what their data contains. Some of them are in a couple million patients, some of the are in tens of millions of patients. And they would be very good sources of data if you want to find trends in a population. Here is an example of an EHR population-level data source, and it's called the Corporate Data Warehouse. Some people call it the clinical data warehouse of the VA, the VHA, the Veterans Health Administration. That provides services to all of our retired military personnel here in the US. And they have more than a 100 hospitals and a lot of clinics within those settings. And the entire network here in the US is on one EHR system. And that EHR system has one data warehouse, which all of the EHR data goes in one place. So that's why they can provide a nationwide population-level view on their patient population using that data warehouse. Their data warehouse has collected data since early 2000 and includes many data types found in EHRs. Such as demographics, consults, health factors, immunization, mental health, and so on. Another dataset that has population-level data is called the Healthcare Cost and Utilization Project, HCUP. This is managed by a federal agency called the Agency of Healthcare Review and Quality or AHRQ, and it's basically hospital discharges. So AHRQ goes to each estate and asks them to report all the hospital discharges into one place. Throughout the slides, we didn't talk about much about hospital discharges, they also use a very specific format or structure for that, so it's not perfect but it's well standardized. And then AHRQ puts it all together and they also create a national view of the HCUP data or the Healthcare Cost and Utilization data. That could be very much used to do research on hospital use and utilization. So there are multiple databases in this HCUP initiative, you can see some of them here. There is a national inpatient sample, it's not the entire nation's hospital discharges. But it's a sample of every state that it's statistically a good representation of the state all in one database. There is one for children called the Kids Inpatient Database, there's one also on the ED's on the national level called the Nationwide Emergency Department Sample. So each of these can give a good national representation of hospital discharges. And there are certain data or formats that's well-documented that you can learn on how to interact with these data bases. Another national-level data base in the HCUP initiative is the nationwide readmission database that tracks patients for readmission purposes across different hospitals. Which is a very important health policy issue here in the US. There are also other ones like the state ones, you can go and basically try to get your hands on one of the state databases and try to do research on it. So there is the state inpatient and also state emergency department databases as well. Another initiative that has created a lot of good value for medical research is called the patient-centered clinical research network or for short, PCORnet. And this is mainly funded by PCOR, which is an institute trying to promote patient-centric research, which brings in the patient's perspective in all of this research. And what they did, they've started funding big medical academic research centers to connect EHR data and create these PCORnet data warehouses. Where that EHR data gives you a good representation of what is happening in the population of that region. So the PCORnet initiative has 29 of these networks or coordinating centers. 11 of them are very clinically oriented, they are called the Clinical Data Research Networks, CRDNs. And again, as I've said, they're academic medical centers or integrated delivery systems. Basically big health networks sharing data in one place to do research with that shared EHR data. And there are 18 Patient-Powered Research Networks or PPRN's that also have the patients' perspective in it, they are special use cases. As you can see the numbers are big, there are almost 155 organizations involved in this PCORnet initiative and more than 3000 collaborators. And individual providers are feeding EHR data into these PCORnets. As I said, the data sources are mainly the EHR data but also they get data from other sources as I have listed here. Now another upcoming source of data is the mHealth data. So now that everybody has a smartphone and they're collecting data and there are initiatives to get that data in one central place for some analytics. One example of it, it's a health kit on the iPhone platform, the OS platform where it collects data from an individual. And there are many apps that can collect even more data and they all go into this HealthKit that then reports to ResearchKit. And there are a lot of evolving things in this area where you can consent patients through their smartphones. And collected all of their data on how many steps they had or how many miles they walked, and things like that in one big database. That's one side of it, another side of the mHealth platforms are something called the smart initiative. Where they created platforms that you can collect data and EHRs by certain apps. And that has also created a lot of different ways that you can connect external devices and your data could be then fed into EHRs. There is a link that I've provided that you can find more information about that smart health IT platform as well. And of course, there are always these large-scale surveys, they're health-related surveys. And the most important ones here in the US are the surveys administered by CDC, the Centre for Disease Control and Prevention. [INAUDIBLE] the list here, the Behavioral of Risk Factor Surveillance System BRFSS. Then we have the the National Health Interview Survey, we have other ones, NHANES and NHCS and NVSS and so on. Most of these, on an academic level, at freely available, you can download them from CDC's website and start using them. But on an individual level, there are certain restrictions on how to use that data. As I said earlier in this video, there are always this issue that we do not have any other data attached to these datasets. Like we don't have the EHR data, claims data or anything else, so what you get is in that data set, and that's it. And your analytic needs to only use that data to answer the questions on your interest. There are also other data sources with a wide population coverage, and some of them are not even health data resources. But they could be used for health outcome research such as some social and administrative sources collected by the federal or state departments. There are tons of environmental data that there are geo-related, I've listed some of them here. Marketing consumer datasets and even financial datasets that might be used for population, health analytics. So in summary, we talked about the factors that affected how these data sources for population level analytics has grown over the last decade or two. And also we reviewed some data sources, including the all-payer claims databases, Medicare and Medicaid from CMS. We talked about some large commercial insurance claims databases that are available out there, and you might use it. We talked about some large health providers that might have a clinical data warehouse that could be big enough for analytics on a population level. And the most important is the VHA, of course, with the corporate data warehouse. And at the end, we also talked about HCUP and other data sources, thank you.