In this lesson we'll describe some best practices for conducting tests beyond the full F-tests in the context of one way ANOVA. And particular we'll introduce a distinction between planned comparisons and post hoc comparisons. And we'll also describe the problem of multiple comparisons will help us differentiate between these two types. So, in the context of one way ANOVA so far, we've seen that the full F-test can help us answer the question, are there differences with respect to the mean of a continuous response variable across different groups or across different levels of a factor. And if we fail to reject the null hypothesis of the full F-test, then we probably don't need to go any further, because we haven't come up with any evidence that there are differences across any of the means. But if we do reject the null hypothesis, then it seems like there's more to do, right. If we reject the null hypothesis of the full F-tests, what we've said is that there is some difference across these different groups with respect to the mean of this continuous variable. But we really don't know what that difference is. It could just be that one of the groups is different from all of the others, and all of the others are the same with respect to the mean of this continuous response. It could mean that every group is different from every other group. And in order to understand the nature of the differences, we'd have to go further, we have to think about other tests to look into. Two different ways of going about this, one would be a planned comparison and the other would be a post hoc comparison. So, let's start with planned comparisons. So, suppose that before we collect and analyze our data, we have reason to believe that certain means might be different. And in such cases we could specify these hypotheses. So, for example, suppose that in our coffee brewing experiment, we had reason to believe that the hyper espresso method would produce espresso with a mean film index greater than the average of the other two means of the other two groups. So, we could write this as a statistical hypothesis, namely that mu2 is = the mean of mu1 and mu3, and maybe the alternative would be that mu2 is greater than the mean of the other two. And the reason for that is it's the way that we specified the research hypothesis, namely that the hyper espresso method would produce espresso with a mean greater than the mean of the other means. So, to test these hypotheses, we could choose a significance level alpha and set up a contrast associated with the null hypothesis. Now contrast is a technical term and it's one that will define in just a bit. And we could use that contrast to test the null hypothesis, we could develop a test statistic based off of that contrast. And further details about the application of these kinds of comparisons, namely planned comparisons will deal with in a future lesson. And what about post hoc comparisons? Well, in some cases, particular hypotheses about the relationship between means aren't specified before looking at the data. And in such cases we would conduct post hoc comparisons. Now post hoc really means after the fact. And so it refers to the idea that we're looking at relationships after we collect the data. Now it might seem trivial to distinguish between tests that were articulated before looking at data and ones that were articulated after looking at data, but this distinction isn't trivial. And in fact statisticians and data scientists have written extensively on the importance here. So the importance really lies in the fact that after the data have been seen, researchers are more likely to zero in on and single out certain relationships that they see in the data that are not really true relationships from the population. So another way of saying this is that looking at trends in the data artificially increases the rate of type one errors. And so post hoc comparisons are not really ideal, we should try to pre specify our hypotheses and plan them. But if we can't do that, then we need to adjust for this problem of the inflation of the type one error. So, we need to figure out a way to adjust our testing methods. And in future lessons will look at the details of that. So it's important to emphasize that post hoc comparisons without some adjustment of error rates is really a statistical fallacy called data dredging. And more broadly, data dredging can be understood as any data analysis that does not take into account multiple comparisons and doesn't adjust the type one error rate accordingly. And there are lots of good examples of actual research that is sort of committing this fallacy of data dredging, and one really interesting and kind of funny one was from a study back in 1988 of some heart attack patients. And this study and sort of the story was detailed in a New York Times article called a failure to heal by Siddhartha Mukherjee. Now in this study, 17,000 patients were included patients that had had a heart attack and some of them were given aspirin to take and some of them were not, and they were given a placebo. And the study was meant to see if there were some health benefits to the aspirin. And in fact the study did find some health benefits there. And the journal that accepted the study for publication, called The Lancet actually asked the authors of the study, Richard Pedo was the main author, asked this author to actually go back and further analyze the data. And the journal really wanted some information on which patients benefited the most. So was it younger subjects, older subjects, was it men or women? Who, what sort of subgroup received the most benefits from the aspirin? So Pedo knew statistics well enough so that he knew he shouldn't conduct this analysis in good faith. And he actually went back and observed, Pedo knew statistics well enough not to conduct such an analysis in good faith, namely going back and re analyzing. So analyzing the data for these relationships after the data were collected and already analyzed once, this would constitute data dredging and it would lead to potentially some unjustified conclusions. So The Lancet really persisted and Pedo did just what any good statistician should do, which is to troll them. So, he analyzed the data for the meaningful relationships, although probably false positives that they had asked for, but then he also subdivided the patients into different groups according to their astrological birth signs. And as Mukherjee recounts, when the tongue and cheek, zodiac subgroups were analyzed, Geminis and Libras were found to have no benefits from aspirin, but the drug produced having the risk if you were born as a Capricorn, so there was an effect for Capricorns. And Pedo now insisted that astrological subgroups should also be included in the paper. And in part he did that to serve as a moral lesson for basically editors to understand that this is not something that they should do have authors go back and re-analyzed data in a way that would produce lots of false positives. So the important takeaway here is that if we perform enough tests after having observed the data, we're almost always guaranteed to draw false conclusions, such as the one drawn in this study based on the zodiac signs. So, zodiac signs are meaningless with respect to aspirin and heart health, and we know that they shouldn't have an effect, and of course they were accidental when the data were analyzed according to these categories. So it's also important to note that the fallacy of data dredging can have ethical implications. So, for example, a study that incorrectly concludes that, a certain treatment is important for a certain subgroup, might erroneously recommend that individuals take that particular treatment or intervention. For example, taking a medication when they shouldn't actually be doing that, it would be a false conclusion, which could lead to certain groups receiving certain harms that they shouldn't receive. And there also could be the issue of erroneously not pursuing a medical intervention that might in fact be helpful. So there are different types of errors here, and if those errors are sort of placed upon certain subgroups, this would be ethically problematic. So the correct use of statistics can be seen as sort of an ethical imperative. We should try to do our analyses in ways that lead to the right conclusions, because the wrong conclusions can disproportionately impact perhaps groups that have been historically marginalized. So it's worth thinking a little bit about why exactly we would have an inflated type one error rate when we do multiple tests. And to think about this, let's actually imagine that we are conducting a study, and we've tested 12 independent hypotheses. And each of those tests has an alpha equal 0.5, so five error rate for each independent test. Well taken together, we should actually analyze the family wise error rate, and this error rate would be the probability that at least one of these tests would show up as a false positive. Now, a little bit of probability theory can help you get the answer to this question, I have it here. And the answer is actually really high, it's approximately 46%, which means that, 46% of the time, if you have 12 independent hypotheses, at least one of those will be a false positive. So close to half of the time, you would have a false positive. Now that's much higher, obviously, than the 5% you thought you were working with because you control the type one error rate at five for each individual test. Of course it goes up when you consider at least one. And that's really the lesson that we should take from this, that we should think about the family wise error rate, the rate taken altogether. So in the next lesson will focus our efforts on understanding how to conduct planned comparisons. And then after that, we'll move on to the proper implementation of post hoc comparisons.