In this video, we'll do some regression modeling, in R- and we'll use a data set that deals with the advertising budgets of some companies. And we'll try to look at the relationship between those advertising budgets and their sales. So the data set consists of 200 different companies, and each of those companies has an advertising budget allocated to YouTube, Facebook and a local newspaper. And each of these advertising budgets is measured in thousands of dollars. Which means that down here for the YouTube, Facebook and newspaper variables. If we wanted to know actually how much money they have in each of these budgets, we have to multiply them by $1000. Now the sales variable, which will act as our response in the regression that we do below. This variable is measured in terms of units sold. Now the sales variable, which we will use as the response in the regression below, is measured in thousands of units sold. So here we would again multiply by 1000 and that would tell us how many units were sold for this company having these budgets here. But it doesn't tell us how much money each unit is worth. And so it doesn't tell us the total amount of sales and how that compares to the total amount spent on these advertising budgets. Now the first thing we'll take a look at, is just an exploration of the data. And what we really want to do is see if the data set that were given has anything odd in it, like missing variables. And that could be coded as NA, NR. Or it could just be something different. Like a bunch of nines is sometimes a code for a missing value of a variable. It's possible that zero is used as a stand in for a missing value for a variable or that can be dangerous, right? If you don't pick up on that, you would be skewing your analysis, depending on how many zeros there are. It could be pretty bad and same thing with several nines in a row. If these are treated as numerical values and not just as codes for missing values, that could be problematic. So we should try to find those things out. So the first chunk of code that I have here is to try to do that. First, I'm just looking at the dimensions of the data frame. So it has 200 rows, so 200 units in the sample and four columns those four variables that we've mentioned. And so the second line, what I'm doing is I'm trying to find any value stored as NA. So here is na for this data frame. This marketing data frame will produce a matrix of logical is true or false, depending on if in that place in the data frame, there's an NA or not. So if we go back up, here is just the beginning of the data frame. There are no NA's here, but suppose there was an NA here instead of 55.08. There was an NA. We would have a false for all of these other values except a true here. And then what I'm having are due a sum up all of those values and are smart enough to know that when you're summing, treat a true as a one and a false as a zero. So that means if we took the sum of all of these points and there were no trues, so no NA's we would get zero. And that's exactly what we get down here. So we know that there are no any values. And then the next line is giving us a summary. And for me, the summary is important because I want to look through and see are there any values like a bunch of nines? That would be a red flag, maybe that's a code for a missing value. Or is there a variable that shouldn't be able to take on zero? That actually is a zero. And so I think these variables here, the only one that takes on zero is the Facebook budget variable. But it's plausible that a company doesn't spend any money on Facebook advertising. So to me that doesn't raise a red flag that that minimum value is zero. If this was something like measuring the weight of people and there was a zero there. To me that would suggest that we have a missing value because people can't weigh Â£0. Now, all of the max values look pretty plausible to me. No string of nines, for example. And so I think we can be reasonably secure in the fact that the values were coded in a way that there are no missing values here. Now, we will start to do some exploratory data analysis now that we know that there are no or we can be confident that there are no missing values. And our exploratory data analysis will have both univariate and bivariate explorations. And we're going to do the EDA on the entire sample that we have. So the entire data set. And in an earlier video, I sort of warned you against doing that because of this phenomenon called double dipping. The idea being that if you explore the data and you found some relationships in the exploration. And then you use the same data to try to come up with a model and explain those relationships, there's an increased chance of error. And in particular, an increased chance of a type one error basically finding relationships that aren't really there. The problem here though, is that we have a pretty small data set. It's only 200 rows. And so there's not really enough data to split it into an exploratory set a training set. Which is where we would fit the model and then a testing set, which is where we might try to validate the model. And so what we'll do in this notebook is we'll explore the entire data set, and then we'll fit a model on that dataset. But we should just be cautious about the conclusions that we draw here, because we have a relatively small sample. And we're doing some exploration and some model fitting on the same set of data. So that means maybe we can think about this entire analysis as exploratory. And if we really wanted to draw some more rigorous business conclusions based on this data, we should try to collect more data. All right, so let's look at some univariate explorations. So first we'll look at some histograms of each of the variables, and I've loaded some packages that may be helpful here. The tidyr verse is the package that I'm using for some of this code, namely really gathering all of the variables that are numeric in this case, they're all numeric. But this code could be used more generally. It would get rid of if you had any categorical variables. And then we're using ggplot, to give us histograms. The ggplot histograms are just a bit more aesthetically pleasing. And they use the grammar of graphics, which is nice, sort of nice way of coding up some plots. So really what I'm doing here is, I'm doing a histogram of each one of the variables. So we'll take the first two first, namely the Facebook budget and the newspaper budget. Now neither of these look normally distributed. So that's one thing to notice. And you'll also notice that the YouTube budget down here does not look normally distributed either. But that's not really a problem. If you remember our regression assumptions don't require that the predictors have normal distributions. So the fact that these don't look bell shaped, is not that much of a big deal. Now what may be an issue is to notice that the newspaper variable potentially has some outliers, right? There are some values in the top of the distribution that seem maybe they were in some way different from the majority of the valley. So we'll keep that in mind. We'll also notice that the histogram for the response. Well, that doesn't look that normal either. It's got something of a bell shaped curve, but there are a lot of points in the center, more so than you would expect for a normal distribution. But you'll remember, and I summarize this down here. The response should be normal, but the assumption of normality and regression is one of the ones lower on the list. So, basically deviations from normality, our model will be pretty robust to those deviations. And in fact, finding the least square solution doesn't require normality. But when we want to do inference, we will make an assumption of normality. But deviations from that assumption are not too terrible. So I'm not terribly alarmed at the fact that we don't have something that looks all that normal. At least not yet. So let's zero in on this issue related to the histogram of newspaper showing potentially some outliers. And what we'll do here is we'll look at a box plot to see if there are any outliers based on this. What we could call the IQR criterion and IQR stands for the inter quartile range. And it defines outliers in the following way. So it says, if you have a data point that's above the third quartile plus 1.5 times the inter quartile range or below the first quartile minus 1.5 times the inter quartile range. Then that data point is an outlier. And here are the definitions of those quantities, just in case you've forgotten quartile right, split the data up into four different categories. So here 25% of the data would be below, the first quartile 75% below the third quartile. And the IQR is defined as the difference between the third and the first quartile. So this gives you one definition of what an outlier might be. It doesn't tell you why that outlier arose, whether it should be taken out of the of the data set or left in. But it's one way of sort of flagging data points. So here are those box plots, and we notice that for the newspaper variable, there are two points that are flags. So R will flag these two points as actual points rather than just including them in the box plots, flag those as outliers. And you'll notice that none of the other variables seem to have outliers at least defined as the way we've done here, this inter quartile range criterion. And then below, I actually flagged those outliers for a newspaper. So you can do that with this code here. If you left off the square brackets, you would get all of them printed out. I'm just separating them. So I can say that the outliers for newspaper are both of these values. All right, so now that we've looked at some univariate plots and summaries. Let's go ahead and look at some bivariate summaries. So the first one will look at is correlations between variables, and we can look at this in different ways. One is to use the corrplot package in R and use the corrplot function. And what that allows us to do is it gives us a matrix of these correlations, and it gives us some visuals related to the strength of the correlation. So here is the code for that up here, I won't say too much about it. It should be easily adaptable to other data sets. And the second line is just giving you a different set of colors other than the default. So I think the default might be sort of standard reds and blues. I chose them to be C bolder colors. So here is that plot of correlations and notice the way you read this. So the response sales is down here in the last row. And so we would read for example, this block as being the correlation between sales and YouTube, the YouTube variable. And so that correlation looks pretty high. 0.78 is pretty high, and we've got like an ellipse here, showing that there's a positive linear relationship. As opposed to something like the correlation between sales and newspaper. Well, this is bit more circular and a lower correlation coefficient. So high correlation, low correlation. And the correlation between sales and Facebook advertising budget. Well that's pretty high too, that looks rather nice. And so we see at least pair wise, that the response has correlations with some of the predictors. And while some correlation with newspaper advertising but not super high. Now these other boxes here. We could look at correlations between the predictor variables, which at some point will become important for us. But right now we'll just notice that the correlation between YouTube advertising and Facebook advertising is quite small. So they're relatively uncorrelated, same thing with YouTube and newspaper. And then the correlation between newspaper and Facebook. Well, that's a bit higher, but at this point, nothing to be excited or alarmed about. Two other quick notes about this plot. Of course down the diagonal, when we're thinking about the correlation between, for example newspaper and newspaper, that correlation is going to be one. So the diagnosis is not very interesting. And then we really only need one side of the diagonal. So the lower triangle or the upper triangle, because notice they're mirror images of each other. So this box up here is the same as this box down here. So knowing about correlations is nice, but it doesn't tell a full story. And correlations are really measures of the strength of a linear relationship. And if you don't actually have a linear relationship between your variables, then the correlation will be misleading. So it's actually nice to look at scatter plots of the different variables. That way you can get a sense, at least in the sample. Are there any relationships at all? And if there are, are they linear or nonlinear? Those will be important questions. So the pairs function can help us get pair wise scatter plots of all of our variables in our data frame. And if we look here at the pair wise plots, one thing that we would want to look at is the response sales. What's the relationship between that and the other predictors? So again, if we look along this bottom row, we can see the relationship between sales and for example here YouTube. So what this tells us is that as the YouTube marketing budget increases, what we do have an increase in sales. But it appears to be non linear, right? It seems to increase a lot for low YouTube marketing budgets. And then once you hit a certain value, well, it's maybe a little bit lower than 50,000. So maybe even 25 or 30,000, you start to taper off. So the trend seems to be non linear and that it's sharp. And then it's less sharp, but still increasing. Another important thing to notice here is that for low value, sorry, low values of the YouTube variable, there's low variability. And then as you get higher, the variability seems to get larger and larger, right? There's a lot of variability out here for large YouTube marketing budgets, but small variability down here. So if we look at sales against the Facebook budget, we'll see that the relationship here seems to be somewhat linear. Although it's not clear again, because there's somewhat low variability for low values of the Facebook variable. And then much higher variability for high values of the Facebook budget. So we could have a linear trend here. But in fact we could have something that's non linear that increases rapidly and then decays off and maybe even takes a downward trend. So based on this amount of variability, it's not entirely clear what the relationship is. And then for sales versus newspaper, well, there doesn't seem to be much of an interesting relationship here at all. It looks a bit like random scatter. So in my summary, I say well, we'll look at the apparent linear relationship between sales and Facebook. But notice that it's not entirely clear with the amount of variability that there is there. It could actually be that the underlying true relationship is nonlinear.