In order to describe the distribution of a quantitative variable, you also need precise numerical descriptions of the center and spread. >> The mode is a kind of average. There are three kinds of average, and each one tells us something different. So we need to make sure we understand what each average means. >> When we use the term average, we mean one of three things, usually. Either we mean the mean average, or we mean the modal average or we mean the median average. It's very easy to understand the difference between these especially if you've played darts before. I threw two lots of three darts, and in my six darts, I scored a 2, three 3s, a 12, and a 13. Now let's see if we can work out the mean, the median, and the mode. First of all, the mean. We take the total of all the six scores, and divide by the number of observations, and that's the mean. If we want to modal score we simply look for the most common score. The most common number of observations. If we want the median score, we write the scores down in ascending order and then look for the middle value. Now there's a slight problem here that we have of an even number of observations, so we take the two middle values and work out the mean average of those two. So for my, not very good dart playing, our scores were 2, 3, 3, 3, 12, 13. The mean is 2+3+3+3+12+13 divided by 6, 36 over 6, equals 6. The mode is 3. The median, since we have an even number of observations. Is 3 plus 3, the middle two observations, divided by 2, which equals 3. Notice if the dart player had scored, say, 19 instead of 13, the mean increase to 7. The the mode and the median score are unchanged. >> So let's briefly review numerical measures of center. Intuitively speaking, the numerical measure of center is telling us what is a typical value of a variables distribution. The three main numerical measures of the center of the distribution are the Mode, the Median, and the Mean. So far, when we looked at the shape of the distribution, we identified the mode as the value where the distribution has a peak. We saw examples when distributions have one mode, that is, a Unimodal distribution, where two modes, a Bimodal distribution. In other words, so far we've identified the mode visually from the histogram. Looking at our histograms again we can easily see the mode. It's the most common occurring value in the distribution. The median, that is the midpoint of the distribution Is the number such, that half of the observations fall above and half fall below. We find the median by ordering the data from the smallest to the largest. Consider when in the number of observations is even or odd. If N is odd, the median is the center observation in the ordered list. When the number of observations is even, the median is the mean or average of the value of the two center observations. The mean, of course, can be calculated by adding up the values for all the observations. In dividing by the number of observations in order to generate a mean average. Our goal here is to describe the distribution. How would you describe these two distributions of exam scores? Both distributions are centered at 70. The mean of both distributions is approximately 70 but the distributions are really quite different. The first distribution has much larger variability in scores compared to the second. In order to describe a distribution, we need to supplement the graphical display. Not only with a measure of center, but also with a measure of the variability, or spread of the distribution. >> There are several ways to describe spread. A commonly used measure is standard deviation. The idea behind the standard deviation is to quantify the spread of the distribution by measuring how far the observations are from their mean. The standard deviation gives the average or typical distance between a data point and the mean. In order to better understand standard deviation, it would be useful to see an example of how it's calculated. In practice of course, the software will be doing these calculations for us. >> Emergency medical services would like to estimate how many ambulance crews to keep on standby. Here are the number of ambulance calls over an 8-hour period. To find the standard deviation of the number of hourly calls, first we would find the mean of our data. [SOUND] Next, we would need to find the deviations from the mean. That is the difference between each observation in the mean. Since our mean is nine, we would subtract nine from each of our observations. As a third step, we would square each of these deviations. Next, we average the square deviations by adding them up and then dividing them by n minus one. That is one less than the sample size. This average of the squared deviations is called the variance. The standard deviation of your variable is the square root of this variance. So why do we take the square root? Note that 16 is the average of the squared deviations and therefore has different units of measurements. In this case, 16 is measured in squared number of ambulance calls, which obviously cannot be interpreted. We therefore take the square root in order to compensate for that fact that we've squared all of our deviations and also in order to go back to the original unit of measurement. We call that the average which number of emergency calls in an hour is 9. The interpretation of standard deviation equal to 4, is that, on average, the actual number of emergency calls each hour is four away from 9. Another way of saying this, is there's an average of nine ambulance calls in each hour, plus or minus 4. Since we're working with very large numbers of observations, hand calculations of standard deviation really aren't feasible. Sass will do all these calculations for you. But it's important to know how to calculate standard deviation so that you can make sense of your variability. [MUSIC] For example, looking at a variable's distribution in two different samples, you should be able to tell which has greater variability, that is, a larger standard deviation. To calculate the standard deviation using Sass, we call the procedure, or proc, univariate. Proc Univariate is followed by a semicolon, the statement VAR and then a list of quantitative variables that you'd like to examine. We're going to run Proc Univariate with NUMCIGMO underscore EST, as the quantitative variable. The statement ends with a semi colon. When we run the Proc Univariate syntax, Sass provides us with tables of univary statistics with then number of cigarettes smoked per month variable. Among others, you can see there's the mean, the median, and the mode. The standard deviation, the variance, and the range. When we scroll down, we can also see a table that shows the cut points for specific percentiles on this variable. We see a table of extreme values, that is highest and lowest. And also, a missing values count. So, you can see that Proc Univariate is extremely useful in better understanding important characteristic of the cigarette smoked per month variable. We now know that the young adult smokers in our sample smoke on average, 320 cigarettes per month. But the median of the amount smoked is 300 cigarettes per month. And that the mode, or the most common number of cigarettes smoked per month is 600. In that the standard deviation is about 274 we can say that on average young adult smokers smoked 320 per month plus or minus 274 cigarettes. So as you can see there's an extremely large range in terms of cigarettes smoked and a lot of variability on this variable, but why didn't we add the nicotine dependence variable to the univariate syntax? It's very important to remember that most of the univariate statistics are not appropriate to calculate for categorical variables, particularly those that are represented with dummy codes. If you'll recall, the nicotine dependence variable is represented with dummy codes. That is, a yes is indicated with a one. And a no is indicated with a zero. If we were to include the categorical variable for nicotine dependence, TAB12MDX in the univariate syntax, Sass would still generate univariate tables. However, the statistics wouldn't make any sense. As you can see we've got a mean and a standard deviation based on dummy codes. Further, percentiles are listed representing yes' and no's, rather than actual quantities. So again, it's very important to remember to use the appropriate descriptive statistics for both quantitative and categorical variables. Categorical variables can often be described well with frequency tables, generated by PROC FREQ or with a bar chart. [MUSIC] For quantitative variables, it's best to examine histograms and then to supplement these with exact measures of shape, center, and spread. [MUSIC]