[MUSIC] Hello everyone, today we're going to talk about how you might want to analyze your read count data after mapping your short reads to a database, or several. So first we're going to talk a bit about some ecological, meaningful ways to describe your data, your different samples. We're going to talk about diversity and richness a bit. And then we're going to talk about how you might want to transform your original raw counts, the read counts, that maps to the different organisms in your sample. Then we're going to talk a bit about how you can do differential abundance analysis to figure out if certain genes or certain organisms are overrepresented in certain samples. Then we're going to talk about some issues with presence-absence detection, and then finally some ordination analysis. So very, very simply, if you have a metagenomic sample, you have done your read counts, you will have some different counts like this of the different organisms you see here. If you count over here that should correspond to the numbers we find here. And richness is simply the number of different organisms in your sample that you find. This can be a useful way to describe the sample, how many different entities do we find in here. But it can also be hard to compare between samples, because the technology, with the sequencing technology, you might get very, very different sequencing depth, so you have sampled. Some samples are higher than others. Another useful way is to describe the alpha diversity of the sample. So there are many, many different ways to do this. There's the Shannon, the Simpson, the Fisher indexes. But we're just going to briefly describe the Shannon-Weiner index, which is a very popular way of describing the diversity. So diversity is basically a measurement of both the richness and the evenness, so the composition of the different entities making up your sample. For example, do we have a few organisms that are dominating the sample, and a lot of rare organisms? Or are all the organisms more or less evenly distributed in the sample, that would affect the diversity measurement, so it's a way to get a single number to describe the composition of your sample. So here we see the counts from before. If we then calculate the relative proportion we have 10 counts in total, then we get 40, 10, 30, 20%. If we then take the log of these numbers, and we get the product of the relative abundance and the log of the relative abundance. Then we can simply add these up and change the sign. And voila, you have a single number to describe your sample. Maybe this doesn't make much sense for one sample, in this scenario, but when you have many samples, now you have something you can actually compare. So a popular way to visualize metagenomic data is in heat maps like this, where you have your samples on one of the axis, here, the row, sample one, two, three. And you have your different species as the columns here. These are the raw counts for these three different samples, and you can see that sample one is basically dominating with a very high count. And if you add these up, you will see that they are very different in sampling depths. So this sequencing run gave us more maps. So a way to basically get around this, is to divide by the sum of each of the samples. So you get a relative proportion like we saw earlier with the diversity. And you can see how this actually affects the clustering now, you have sample one and two clustered together instead of two and three, and these look more similar. You might also uniquely for sequencing, not so relevant for general ecology if you were counting organisms, but for sequencing, well, we're actually probing the genomic content of asample. You might want to also normalize to the gene lengths of the genome sizes of each of the samples. Because maybe this species 1 just has high counts in general because it's a larger genome, not because there are more copies of the genome or the organism was more abundant in that sample. So we might both want to divide by the species sums and the column sums. And then again there are completely different ways to try to standardize your matrix to make it comparable. There are many different ecological transformations or standardizations. In here we have done something called the Wisconsin Double standardization. So if you have some different groups of samples, you have quantified all the different genes or the organisms in them, and now you want to know is this organism or is that gene higher in this group than in that group? You might want to perform a differential abundance analysis. And so you could think that okay, let's just run a t-test or something like that for each and every organism that we have. But there are actually smarter ways to go about this, because we have a lot of of things with this kind of data that cause issues for normal statistical methods. For example we can have very, very uneven sampling depths in our samples. Here we see a histogram for an actual sequencing project where most samples will have around 50 million reads, but then you have some with less than ten, very low. And you have a single sample with more that 100 million reads. This something we need to account for. Then also if you look at these specific genes here, an antimicrobial resistance gene, across your samples, you might see that in a lot of samples, it is not observed, it's zeroes. And then your counts are distributed like this in a negative binomial distribution. So you want your statistics to be able to handle this kind of data. Which, a normal T-test would not be very good at. Then you have multiple testing, you have maybe thousands of genes, so the p-value you get for each gene doesn't necessarily mean that much. You need to account for that and adjust for multiple testing. And then finally, you have a variance that is not the same, it varies according to your abundance. So you can see, you can have a higher or smaller spread depending on how abundant your genes are or your organisms are. And then also, again, maybe you would like to account for the different sizes of the genomes. Here you can just see some different kinds of organisms and they have very, very different genome sizes. This is something you also might want to account for. So in summary, use a method that is actually meant for this zero inflated negative binomial data. And there are several R libraries you can use. For example you can use DESeq2, MetagenomeSeq, EdgeR, BaySeq. And some of the DESeq2 features for example is you can account for these different sequencing depths. It can do outlier detection for each of the genes, you have to find some groups, maybe there are some wacky counts. It can also, it borrows information across the genes. So because there is a good association between genes' mean abundance and how dispersed it is, so you get more statistical power by analyzing this way and not just running a thousand different T-tests. It can also adjust your P-values for you. So presence-absence, if you're doing surveillance and you just want to know, is this gene present, or is this bacteria, this pathogen present or not, it can be a bit hard with metagenomic data. Because your read counts are going to very, very much depend on how deeply you sequenced your sample. So just saying this gene is not there or this pathogen is not there, that can be a very tough call to make. But what you can do is you can spike-in experiments where you add some bacteria or genomic material to your sample. Then you go ahead and process, sequence your sample, do your read mapping, and then maybe if you know exactly how much you added, then you can extrapolate, okay, our detection limit is like this. And then, you can go out and say, okay, tf the organism is present then it's at least less abundant than this. So ordination analysis is a way to summarize this multivariate data in an easier to understand way. If you have all your different organisms here as columns and then you have all your rows as samples, you can see these are all the counts represented in the heat map. So these are some of the steps you might want to do in an ordination analysis first. Maybe you would like to not use the raw counts, but actually transform them in some way, standardize it. Here you can see we again did the Wisconsin Double standardization and now you can suddenly see a lot more of what is going on. All these very, very rare organisms over here, we can now see what is happening. Then when you have your standardized abundance matrix you can calculate the dissimilarities, how similar or how dissimilar is each of our samples. That is what you see here with samples both as rows and as columns, with blue meaning they are not dissimilar, that is they are very, very similar. So each time the sample is compared to itself, it's completely blue. It's zero dissimilarity, and you can see you have some more dissimilar samples. Then when you have this matrix you can do an ordination like a principal coordinate analysis, and then you can get an image where the distances between your samples and the image are approximate to how dissimilar they are. So this is a nice way to reduce this multi dimensional data into something that is easier to understand. I mean when you reduce maybe 1,000 dimensional data to only two dimensions, you might not be able to describe all the variation but could be a good starting point. You can do a lot more with ordination. You can actually analyze these dissimilarities, statistically. There are a lot of R functions in the vegan library, adonis2, betadisper, anosim, where you can actually go in and test different things. For example you might test if the overall changes in the microbiome differs between different groups. Maybe you have done surveillance, different years, or different places, different areas, countries. And you want to test if there's an overall difference among the pathogens, the bacteria in general, or just the antimicrobial resistance genes. So yes, that's all for this lecture. [MUSIC]