Welcome back. The question of why things happen is at the core of just about everything we do in analytics. Whether we're looking at what happened in the past, what will happen in the future, or what we should do about it, it's in our nature to seek to understand what's really going on. We want to know why. We want to know what cause something to happen or what will cause something to happen. Not only does that knowledge help us today, but it helps grow our understanding of how things work and informs how we think about our business. However, the real world is complicated and really understanding the causes of the events we see is much easier said than done. In this video, we're going to talk about the ideas of correlation and causation and address one of the most common errors that occurs in analytical work, specifically mistaking one for the other. So what is correlation? In the simplest terms, correlation is a mutual connection or relationship between two or more things. When we think about correlation in analytics we're usually referring to how we see characteristics vary in relationship to each other. When one measure is higher or lower does another measure vary in a consistent or predictable way? If so we'd say the measures are correlated. The simplest way we normally see correlation is when we plot the values of one measure or variable versus another. Here's an example. Here we have a plot of average height versus average weight for women in the United States. As we might expect as height increases so does weight. In fact, the relationship between the two almost looks like a straight line or what we call a linear relationship. However, there are number of regular and irregular patterns we might see when looking for relationships. Here are a few. Some of these are positive correlations, like the log, exponential, power, and logistic patterns. In these patterns, when one value goes up the other value also goes up. Some are negative correlations, like the negative linear and negative exponential pattern. In these patterns when one value goes up the other value goes down. And the others are more complicated. The quadratic, threshold step, and cyclical patterns all suggest a relationship between the two values. When we're dealing with data in the real world the relationships we see aren't this clean. There's usually some noise or variation in our measures, so our visualizations are a bit fuzzier. As we can see in this example diagram the strength of a correlation we see can be stronger or weaker depending on how tightly related that values are. These examples of variations on a linear relationship. But the same would be true around any of the other types of patterns we'd observe. There are more specific measures of correlation that apply in statistics and mathematics. The most common is the Pearson Correlation Coefficient, which measures the degree to which there is a linear relationship between two variables. We usually see this value represented by the letter r. Now, our objective in this video is not to get too deep into the math itself, so we won't be presenting the equation that we use to calculate r. But to make a long story short, if we have two set of measures that are perfectly positively correlated, we have r=1, like the first diagram. Conversely if our measures are perfectly negatively correlated, we have r=-1 like the last diagram. And if there is no correlation at all we have r=0 like the middle diagram. Of course it's generally the case that we see something that's in between like the remaining cases. You might recall a similar measure that shows up in the result of a linear regression. Usually there's something called coefficient of determination or r squared which is just a square of r. And like r, it's a measure of the strength of an observed relationship between two sets of values. We bring it up here just so you make the connection to that type of analysis. It's not uncommon for someone to show up with a regression analysis that has a high r squared value and who wants to jump straight to causality. Since causality is what we really wanted to get after anyway, let's shift gears a bit and talk a little bit about that. Causation means that one event or state is the result of the occurrence of another event or state. In other words there is a cause in effect relationship between two or more ideas. When we see data that applied a relationship a causal relationship is one option for what's really going on. But it's not the only one. Let's assume we have two ideas A and B and we've observe a correlation between them. How might A and B actually be related? Well, we might suggest that A causes B. Conversely, we could also suggest that B causes A. However, it could also be the case that there is a third factor, let's call it C, that actually causes both A and B, such that there really is no causal relationship between A and B. It's also possible that A does cause B or vice versa but the causation actually happens through C as an intermediate factor. Finally it can also be the case that there's actually no relationship at all and what we are seeing in the data is pure coincidence. This is where we sometimes get ourselves into trouble. As data analyst, we're kind of hardwired to believe that there's an answer in the data somewhere. And it makes it really hard to accept that there might not actually be a relationship in something that might look clearly related to our eye. But it turns out that it's not too hard to find examples of two things that seem to correlate almost perfectly with each other, but which in fact are completely unrelated. If you haven't already visited Tyler Vigen's Spurious Correlations website, I suggest you pause this video and take a few minutes to scroll through some of the more entertaining examples of this you'll find. For those of you who can't get there now, here are a few examples. Each one of these show a high degree of correlation between two completely unrelated sets of value. So, the point here is we need to resist the urge to assume relationships exist, when it's possible they don't. The other mistake we tend to make is that we get too focused on the ideas we have in front of us and forget to consider the influence of other factors. The specific error I see more than any other is assuming a causal relationship between two characteristic or events when there's really a third factor causing them both or this relationship we saw earlier. Let's illustrate this using an example. Let's say that we're a wireless carrier, and we're trying to assess the impact of people accessing their account online. Among other things, we look at the simple relationship between historical account access by an individual and the likelihood that the individual will cancel their service. What we basically find is that people who have accessed their account or half is likely to cancel. The business manager in charge of customer attention says, this is great! All we need to do is, incent people to use the web and we could reduce our cancel rate by half. What's wrong with this interpretation? Well, it depends on whether or not you believe that the act of accessing the web is influencing someone's likelihood to cancel. It turns out that it's far more likely that a third factor like age, comfort with technology, or just willingness to engage with the company, is driving both the likelihood to use the web and the likelihood to cancel. Simply incenting people to use the web isn't likely to change any of these underlying causal factors and therefore isn't likely to have an impact on cancellation. This may seem like a really obvious example but it actually happened. In fact I've seen this exact scenario and rationale around web accessing cancellation come up not just once but at least three times in three different places during my time in the wireless industry. So it definitely happens. Even very smart people can make silly mistakes sometimes. So how do we avoid mistaking correlation for causation? Is there a way we can prove that causation does in fact exist? More often than not, the answer is no. Proving causation is pretty hard. But what we can do is eliminate alternate explanations either through context or by showing empirically that other relationships can exist. In our web cancellation example, we used reasoning based on our knowledge of the industry and likely customer behavior to question the assumption of causality. Again, context turns out to be critical and interpreting relationships and data. It should be the first line of defense in avoiding mistakes. Can we think of any other plausible explanations for what we see? Is there any other data that would contradict and assume the relationship? We can also apply simple ideas like temporal precedence. For a relationship to be causal, the causing factor needs to be present before we see the effect. If we see something that we think is an effect happen before its cause, we know that we have the wrong relationship. It turns out that one of the best ways to isolate causation is to run a controlled experiment. In a simple controlled experiment, I normally isolate two randomly selected groups of subjects and apply treatment to one group while not applying that treatment to the other. I call this the treatment in control groups respectively. I then observe differences between the two groups. If I observe a difference between the treatment and the control and I've ensured that the only thing that differed between the groups is the treatment that was applied, then I have strong evidence that the treatment caused the difference. If we have the luxury of an experiment that's great, but if we don't, we fall back on our context, logic and alternate explanation based approaches for assessing causality. So let's circle back to where we started. As a analyst where constantly asking the question why and finding causal relationships and data is big part of answering that question. But it is important to recognize the pitfall that we can fall into and mistaking correlation for causation. In this module we'll continue to explore ways we can fail to interpret data correctly, and learn what we can do to avoid those mistakes. [BLANK AUDIO]