A different kind of scaling problem arises when we try to answer queries over a large number of data sources, but before we do that let's see how a query is answered in the virtual data integration setting. We are going to use a toy scenario in a medical setting simple as we have four data sources. Each with one table for the sake of simplicity. Notice that two sources, S1 and S3, have the same schema. Now this is entirely possible because sources may be independent of each other. Further, there is no guarantee that they would have the same exact content. Maybe these two sources represent clinics at different locations. So next, we look at the target schema. For simplicity, let's consider that it's not an algorithmically creative probabilistic mediator schema but just a manually designed schema with five tables. But while we assume that the target schema is fixed we want the possibility that we can add more sources. That means more clinics as the system grows. Now I'm beginning to add the schema mapping. Now there are several techniques of specifying schema mappings. One of them is called Local-as-View. This means we write the relations in each source as a view over the target schema. But this way of writing the query, as we can see here, may seem odd to you. It's called syntax, but you don't need to know it. Just as an example, the first few things the treats relation in S1 maps to, so you see that arrow, that means maps to. So it maps to the query select doctor, chronic disease from treats patient, has chronic disease. Where treatspatient.patient Is equal to has chronic disease dot patient. We see the query here in the yellow box. The only thing we should notice here is that the select class on the query has two attributes doctor and chronic disease which are exactly the same attributes of the treats relation in S1. Now let's ask a query that gives the target schema. Which doctors are responsible for discharging patients? Which translates to the SQL query shown here. Now, the problem is how to translate this query to a query that can be sent to the sources. Now ideally this should be simplest query with no extra operations as shown here. S3 treats means treats relation in sources 3. Now you can see the ideal answer. To find such an optimal query reformulation, it turns out that this process is very complex and becomes worse as a number of sources increases. Thus query reformulation becomes a significant scalability problem in a big data integration scenario. Let's look at the second use case public health is a significant component of our healthcare system. Public health systems monitor detect and take action when epidemics strike. Not so long ago we have witnessed public health concerns due to Anthrax virus, the swine flu and the bird flu. But these epidemics caused a group called WADDS to develop a system for disease surveillance. This system would connect all local hospitals in the Washington, DC area, and is designed to exchange disease information. For example, if a hospital lab has identified a new strain of a virus, other hospitals and the Centers for Disease Control, CDC, in the network, should be able to know about it. It should be clear that this needs a data integration solution where the data sources would be the labs, the data would be the lab tests medical records and even genetic profiles of the virus and the subjects who might be infected. The table here shows the different components with this architecture. We will just digest the necessary parts for our requirement. Just know that RIM which stands for Reference Information Model is global schema that this industry has developed and expects to use as a standard. Why we want to exchange and combine new information from different hospitals? Every hospital is independent and can implement their own information system any way they see fit. Therefore even when there are standards like HL-7 that specify what kind of data a held cache system should have an exchange. There are considerable variations in the implementation of the standard itself. For example the two wide boxes show a difference in representation of the same kind of data, this should remind you of the data variety problem. Let's say, we have a patient with ID 19590520 whose lab reports containing her plasma protein measurements are required for analyzing her health condition. The problem is that the patient went to three different clinics and four different labs which all implement the standards differently. On top of it? Each clinic uses its own electronic medical record system which we have a very large amount of data. So the data integration system's job is to transform the data from the source schema to the schema of the receiving system in this case the rim system. This is sometimes called the data exchange problem. Informally a data exchange problem can be defined like this. Suppose we have a given database whose relations are known. Let us say we also know the target database's schema and the constraints the schema will satisfy. Further we know the desired schema mappings between the source and this target schema. What we do not know is how to populate the tuples in the target database. From the tuples in the socialization in such a way that both schema mappings and target constraints are simultaneously satisfied. In many domains like healthcare, a significant amount of effort has been spend by the industry in standardizing schemas and values. For example LOINC is a standard for medical lab observations. Here item like systolic blood pressure or gene mutation are encoded in this specific way as given by this standard. So, if we want to write that the systolic/diastolic pressure of a individual is 132 by 90, we'll not write out the string systolic blood pressure, but use the code for it. The ability to use standard code is not unique to healthcare data. The 50 states of the US all have two letter abbreviations. Generalizing therefore, whenever we have data such as the domain is finite and have a standard set of code available, we give a new opportunity of handling big deal. Mainly, reducing the data size through compression. The compression refers to a way of creating an encoded representation of data. So that this encoder form is smaller than the original representation. A common encoding method is called dictionary encoding. Consider a database with 10 million record of patient visits a lab. Each record indicates a test and its results. Now we show it this way like in a columnar structure to make the point that the data is kept in a column stored relational database rather than a row store relational database. Now consider the column for test code. Where the type of test is codified according to the standard. We replace a string representation of the standard by a number. The mapping between the original test code and the encoded number are also stored separately. Now suppose there are a total of 500 tests. So this separate table called the dictionary here has 500 rows, which is clearly much smaller than ten million right? Now 500 distinct values can be represented by encoding them in 9 bits, because 2 to the power of 9 is 512. Other encoding techniques would be applied to attributes like date and patient ID. That's full large data we cannot reduce the number of total actual rules. So we have to store all ten million rules. But we can reduce the amount of space required by storing data in a column oriented data store and by using compression, indeed modern systems use credit processing algorithms to operate directly on compress data. Data compression is an important technology for big data. And just like is a set of qualified terms for lab tests, clinical data also uses SNOMED which stands for systematized nomenclature of medicine. SNOMED is a little more than just a vocabulary. It does have a vocabulary of course. The vocabulary is the collection of medical terms in human and medicine to provide codes, terms, synonyms and definitions that cover anatomy, diseases, findings, procedures, micro organisms, substances etcetera. But it also has relationships. As you can see, a renal cyst is related to kidney because kidney's the finding site of a renal cyst. If we query against an ontology, it would look like a graph grid. In this box, we are asking to find all patient findings with a benign tumor morphology. In terms of querying, we are looking for edges of a graph where one noticed the concept that we need to find which is connected to a node called benign neoplasm that is benign tumor through an edge called associated morphology that applying this query against the data here produces all benign tumors of specific organs as you can see by the orange rounded group. But now that we have these terms, we can use these terms to search outpatient records with these terms would have been used so what's the essence of this used case. This is can shows that an in division system in a public health domain and in many other domains must be able to handle variety. In this case there's a Global Schema called RiM shown here all queries and analyses performed by a data analysts should be against this global schema. However, the actual data which is generated by different medical facilities would need to be transformed into data in this schema. This would require not only format conversions, but it would need to respect all constraints imposed by the source and by the target. For example, a source may not distinguish between an emergency surgical procedure and a regular surgical procedure. But the target may want to put them in different tables. We also saw that the integration system for this used case would need to use quantified data but this gives us the opportunity to use data compression to gauge story and query efficiency. In terms of variety, we saw how relational data like patient records XML data like HL7 events, and graph data like ontologies, are co-used. To support this, the integration system must be able to do both model transformation and query transformation. Query transformation is the process of taking a query on the target schema. Converting it to a query against a different data model. For example, part of an SQL query against the RIM may need to go to snowmad and hence when you to be converted to a graph query in the snowmad system. Model transformation is a process of taking data represented in one model in one source system and converting it to an equivalent data in another model the target system.