The first question is, what type is the response variable? Is it categorical or quantitative? For our sample research question, the response, or dependent variable, is nicotine dependence, which is categorical. Next, we need to determine how many categories are in this response variable. Since nicotine dependence is coded 1 for yes or present, and 0 for no or absent, we have two categories in the response variable. The next question to ask is, what type is the explanatory variable? The explanatory, or independent variable, is number of cigarettes smoked per month. As we saw in the demonstration of histograms, this is a quantitative variable. Since it won't be visually meaningful to examine a bar chart with a quantitative explanatory variable on the Y axis, When our response variable is actually categorical. Before we start to graph, it's important to Bin or explanatory variable into categories. That is, in order to visualize the relationship that we're interested in, we need to add some data management that will allow us to construct a C to C, Or categorical to categorical bar chart. To convert a quantitative variable into a categorical variable, we begin by looking at the frequency table for the explanatory variable, number of cigarettes smoked per month. We could use the cumulative percent column, to make decisions about grouping individuals into quartiles. Roughly four equal groups in size, or even quintiles, five equal groups in size. However, in this case, it seems a better decision might be to create more meaningful smoking groups, based on specific quantities. Cigarette packs contain 20 cigarettes each. We're going to create a new variable, that estimates the number of packs that each individual smokes per month rather than the number of cigarettes. This could be a step closer to a categorical variable, that's meaningful. Returning to the program, within the data step, where all data management is conducted, we add the following syntax, PACKSPERMONTH=NUMCIGMO_EST/20;. The new variable is PACKSPERMONTH, and it's set equal to the number of cigarette smoked per month, divided by 20, followed by a semicolon. Then we add this new variable PACKSPERMONTH, to the TABLES statement, so we can view this new frequency distribution. PACKSPERMONTH is still a quantitative variable. But now we can more easily create groups, based on number of packs smoked in a month. After examining the frequency distribution, we decide to create groupings that include those who've smoked one through five packs per month, Six through ten packs per month, 11 through 20, 21 through 30, And then 30 plus packs per month. To do this, we need to add the following syntax to our program. Here we tell SAS if packs per month is less than or equal to five, then pack category, our new variable, is equal to three. We chose three, because it's roughly the quantitative midpoint for this category. Else if packs per month is less than or equal to ten, then pack category equals seven. Again roughly the midpoint. Else if packs per month is less than or equal to 20, then pack category equals 15. Else if packs per month is less than or equal to 30, then pack category equals 25. And finally, else if packs per month is greater than 30, then pack category equals 58. Again 58, if we examine the frequency distribution, is roughly the quantitative midpoint. Again, and as always, all of these statements are ended with a semicolon. When we add this new variable, pack category to the table statement and then run the program, we can examine the frequency distribution for the new variable. With this new categorical variable representing packs of cigarettes smoked per month, we've retained as much of the quantitative features of the original variables we could manage, while also assuring the graph will be interpretable now that the explanatory variable is categorical. Back to our graphing decisions flow chart. Now that we're collapsed our explanatory quantitative variable into categories, we're ready to make our C to C, or category to category, bar chart. When graphing the relationship between a categorical explanatory variable and a categorical response variable, we use the code, PROC GCHART; VBAR categoricalexplanatoryvariable/discrete TYPE=mean SUMVAR=categoricalresponsevariable and end it with a semi-colon. So the exact code we'll use for the program is, PROC GCHART; VBAR PACKCATEGORY/discrete TYPE=mean SUMVAR=TAB12MDX;. Just like the proc gchart code in univariate graphing, VBAR requests a vertical bar chart. The categorical explanatory variable, PACKCATEGORY/discrete tells SAS that we want levels of our categorical explanatory variable to be represented on the X axis. The rest of the code in this statement provides instructions to SAS, for how the response variable should be represented on the y axis. Specifically, TYPE=mean requests a calculated average. And SUMVAR is short for summary variable. So we're asking for the response variable, TAB12MDX, to be displayed as a mean on on the Y Axis. Here's our categorical by categorical bar chart. Pack category, or explanatory variable, is on the X axis. And this is by the rate or proportion of nicotine dependence along the y axis. So you can see from this graph, among those smoking one to five packs a month, about 25% of those individuals are nicotine dependent. Among those smoking six to ten packs a month, 50% are nicotine dependent. Among those smoking eleven to twenty packs a month, 58% are nicotine dependent. Among those smoking 21 to 30 packs per month, almost 70% are nicotine dependent. And among those smoking more than thirty packs a month, more than 70%, are nicotine dependent, around 77 %. We can also see that these rates form a pattern. That is, the more packs smoke a month, the higher the rate of nicotine dependence. So in a graphical way, we're already seeing that there seems to be a relationship between smoking and nicotine dependence as we hypothesized. Looking at our graphing decision chart, we can see the steps we've taken to generate a bivariate graph with a categorical response variable that has two categories, and a quantitative explanatory variable. We also discussed how to convert the quantitative explanatory variable to a categorical variable. A step which must be taken for the purposes of visualizing the relationship. If our explanatory variable was originally categorical rather than quantitative, we could have skipped this step, and just moved on to a categorical by categorical bar chart. What decisions need to be made, if the response variable has more than two categories? In this case, we would need to collapse response variable categories, into two categories. To demonstrate this, we'll have to modify the research question. So let's modify the research question, to look at the association between ethnicity and smoking stage. And we'll create a response variable, that categorizes young adult smokers into three groups. Non-daily smokers, daily smokers, and those with nicotine dependence. These are the ethnic groups recorded in the NES Art code book. Along with the syntax that we can use to create a three category smoking stage variable. This sample can be described with these three smoking categories. This univariate bar charts shows that about 50% of the young adults sampled, are nicotine dependent. About 30% are daily smokers without nicotine dependence. And almost 17% are non-daily smokers. However, to examine a relationship between this variable, as the response variable and another, we need to collapse this to only two categories. To do this, we need to make some decisions. Here are two perfectly reasonable decisions that we could make. We could examine the association between ethnicity and daily versus non- daily smokers. Or, we could examine the association between ethnicity. In nicotine dependent versus non-nicotine dependent individuals, thereby collapsing across these categories in some way. In either case, some data management needs to be added to the program. To collapse the response variable into daily versus non-daily smokers, we use this syntax. IF S3AQ3B1=1, that is if the individual smokes 30 days a month, then daily equals one semicolon. ELSE IF S3AQ3B1 is not equal to dot, that is it's not equal to missing, then daily equals 0, again followed by a semicolon. To graph the relationship between a categorical explanatory variable and a categorical response variable, we use the same code for graphing the relationship between a categorical explanatory variable, and a quantitative response variable. A response variable that has been bend into two categories. PROC GCHART; VBAR ETHRACE2A, which is our categorical explanatory variable. Forward slash, discreet, type=mean, SUMVAR=DAILY a categorical response variable, followed by a semicolon. Remember, a categorical response variable, should not have more than two categories or levels. And those two categories should be coded as 0 and 1. 0 represents no, or negative observations, and 1 represents yes, or positive observations. In this format, requesting the mean of a categorical response variable, actually gives us the proportion of ones or positive observations. >> Because a response variable was categorical with more then two categories, we needed to collapse it into only two categories. And because our explanatory variable, ethnicity was categorical, we created a categorical by categorical bar chart. Had our explanatory variable been quantitative, we would have needed to Bin or collapse that variable into categories before creating the categorical by categorical bar chart.