In this section, examining relationships, we will look at two variables at a time and, as the title suggests, explore the relationship between them using visual displays and numerical summaries. Here are a few examples of such research questions with the two variables highlighted

**Examples :**Is there a relationship between gender and test scores on a particular standardized test?

Is performance on the test related to gender?

Is there a gender effect on test scores?

Are there differences in test scores between males and females?

How is the number of calories in a hot dog related to (or affected by) the type of hot dog (beef, meat or poultry)? In other words, are there differences in the number of calories among the three types of hot dogs?

Is there a relationship between the type of light a baby sleeps with (no light, night-light, lamp) and whether or not the child develops nearsightedness?

Are the smoking habits of a person (yes, no) related to the person's gender?

How well can we predict a student's freshman year GPA from his/her SAT score?

What is the relationship between driver's age and sign legibility distance (the maximum distance at which the driver can read a sign)?

Is there a relationship between the time a person has practiced driving while having a learner's permit, and whether or not this person passed the driving test?

Can you predict a person's favorite type of music (classical, rock, jazz) based on his/her IQ level?

In most studies involving two variables, each of the variables has a role. We distinguish between:

**the explanatory variable (also commonly referred to as the independent variable)**- the variable that claims to explain, predict or affect the response; and**the response variable (also commonly referred to as the dependent variable)**- the outcome of the study. Typically the**explanatory (or independent) variable is denoted by X**, while the response (or dependent) variable is**denoted by Y**.

We want to explore whether the outcome of the study—the score on a test—is affected by the test-taker's gender. Therefore:

**Gender is the explanatory Test score is the response**In this study we explore whether the nearsightedness of a person can be explained by the type of light that person slept with as a baby. Therefore: Light type is the explanatory Nearsightedness is the response

Here we are examining whether a student's SAT score is a good predictor for the student's GPA freshman year. Therefore: SAT score is the explanatory GPA of freshman year is the response

Here we are examining whether a person's outcome on the driving test (pass/fail) can be explained by the length of time this person has practiced driving prior to the test. Therefore: Time is the explanatory Driving test outcome is the response

**Example Research Questions with Two Variables**Example 1: Is there a relationship between gender and test scores on a particular standardized test? OR Is performance on the test related to gender? OR Is there a gender effect on test scores? OR Are there differences in test scores between males and females?

Example 2: How is the number of calories (response) in a hot dog related to (or affected by) the type of hot dog (explanatory) (beef, meat or poultry)? In other words, are there differences in the number of calories among the three types of hot dogs?

Example 3: Is there a relationship between the type of light a baby sleeps with (no light, night-light, lamp) and whether or not the child develops nearsightedness (response)?

Example 4: Are the smoking habits of a person (response) (yes, no) related to the person's gender ?

Example 5: How well can we predict a student's freshman year GPA (response) from his/her SAT score ?

Example 6: What is the relationship between driver's age and sign legibility distance (the maximum distance at which the driver can read a sign) (response)?

Example 7: Is there a relationship between the time a person has practiced driving while having a learner's permit , and whether or not this person passed the driving test (response)?

Example 8: Can you predict a person's favorite type of music (response) (classical, rock, jazz) based on his/her IQ level?

**There are studies in which the role classification is not really clear**. This mainly happens in cases when both variables are categorical or both are quantitative. An example is a study that explores the relationship between students' SAT Math and SAT Verbal scores. In cases like this, any classification choice would be fine (as long as it is consistent throughout the analysis).

If we further classify each of the two relevant variables according to type (categorical or quantitative), we get the following 4 possibilities for "role-type classification"

Categorical explanatory and quantitative response

Categorical explanatory and categorical response

Quantitative explanatory and quantitative response

Quantitative explanatory and categorical response

This role-type classification can be summarized and easily visualized in the following table (note that the explanatory variable is always listed first). In each of the 4 cases, different statistical tools (displays and numerical measures) should be used in order to explore the relationship between the two variables.

When confronted with a research question that involves exploring the relationship between two variables, the first and most crucial step is to determine which of the 4 cases represents the data structure of the problem. In other words, the first step should be classifying the two relevant variables according to their role and type, and only then can we determine what statistical tools should be used to analyze them.

**Example: 1**Gender is the**explanatory variable and it is categorical**. Test score is the**response variable and it is quantitative**. Therefore this is an example of case C→Q.**Example: 3**Light Type is the**explanatory variable and it is categorical.**Nearsightedness is the**response variable and it is categorical.**Therefore this is an example of case C→C.**Example: 5**SAT Score is the**explanatory variable and it is quantitative.**GPA of Freshman Year is the**response variable and it is quantitative.**Therefore this is an example of case Q→Q.**Example: 7**Time is the**explanatory variable and it is quantitative.**Driving Test Outcome is the**response variable and it is categorical.**Therefore this is an example of case Q→C

Is there a relationship between gender (categorical) and test scores on a particular standardized test (quantitative)?

**Q→C**How is the number of calories (quantitative) in a hot dog related to (or affected by) the type of hot dog (beef, meat or poultry)? In other words, are there differences in the number of calories among the three types of hot dogs (categorical)?

**C→Q**Is there a relationship between the type of light a baby sleeps with (no light, night-light, lamp) (categorical) and whether or not the child develops nearsightedness (categorical)?

**C→C**Are the smoking habits of a person (yes, no) (categorical) related to the person's gender (quantitative)?

**Q→C**How well can we predict a student's freshman year GPA (quantitative) from his/her SAT score (quantitative)?

**Q→Q**What is the relationship between driver's age (quantitative) and sign legibility distance (the maximum distance at which the driver can read a sign) (quantitative)?

**Q→Q**Is there a relationship between the time a person has practiced driving (quantitative) while having a learner's permit, and whether or not this person passed the driving test (categorical)?

**Q→C**Can you predict a person's favorite type of music (classical, rock, jazz) (categorical) based on his/her IQ level (quantitative)?

**Q→C**A store asked 250 of its customers whether or not they were satisfied with the service. The purpose of this study was to examine the relationship between the customer's satisfaction and gender. Both the explanatory (gender) and response (satisfaction) variables are categorical in this case. Therefore, this is an example of case C→C.

A study was conducted in order to explore the relationship between the number of beers a person drinks, and his/her Blood Alcohol Content (BAC, in %). Both the explanatory (number of beers) and response (BAC) variables are quantitative in this case, and therefore this is an example of case Q→Q.

A study was conducted in order to determine whether longevity (how long a person lives) is related to a person's handedness (right-handed/left-handed). In this case the explanatory variable (handedness) is categorical and the response variable (longevity) is quantitative. Therefore, this is an example of case C→Q.

People who are concerned about their health may prefer hot dogs that are low in calories. A study was conducted by a concerned health group in which 54 major hot dog brands were examined, and their calorie contents recorded. In addition, each brand was classified by type: beef, poultry, and meat (mostly pork and beef, but up to 15% poultry meat).

**The purpose of the study was to examine whether the number of calories a hot dog has is related to (or affected by) its type.**

Answering this question requires us to examine the relationship between the categorical variable, Type and the quantitative variable Calories. Because the question of interest is whether the type of hot dog affects calorie content,the

**explanatory variable is Type**, andthe

**response variable is Calories.**Here is what the raw data look like:

The raw data are a list of types and calorie contents, and are not very useful in that form. To explore how the

**number of calories is related to the type of hot dog, we need an informative visual display of the data**that will compare the three types of hot dogs with respect to their calorie content.The visual display that we'll use is

**side-by-side boxplots**. The side-by-side boxplots will allow us to compare the distribution of calorie counts within each category of the explanatory variable, hot dog typeAs before, we supplement the

**side-by-side boxplots with the descriptive statistics of the calorie content (response)**for each type of hot dog separately (i.e., for each level of the explanatory variable separately)Let's summarize the results we got and interpret them in the context of the question we posed:

By examining the three side-by-side boxplots and the numerical summaries,

**we see at once that poultry hot dogs, as a group, contain fewer calories than those made of beef or meat**. The median number of calories in poultry hot dogs (113) is less than the median (and even the first quartile) of either of the other two distributions (medians 152.5 and 153).The spread of the three distributions is about the same, if IQR is considered (all slightly above 40), but the (full) ranges vary slightly more

**(beef: 80, meat: 88, poultry: 66)**.The general recommendation to the health-conscious consumer is to eat poultry hot dogs. It should be noted, though, that since each of the three types of hot dogs shows quite a large spread among brands,

**simply buying a poultry hot dog does not guarantee a low-calorie food.**What we learn from this example is that when exploring the relationship between a categorical explanatory variable and a quantitative response (Case C→Q), we essentially

**compare the distributions of the quantitative response for each category of the explanatory variable using side-by-side boxplots supplemented by descriptive statistics**.

Statistic | Beef | Meat | Poultry |
---|---|---|---|

min | 111 | 107 | 86 |

Q1 | 139.5 | 138.5 | 100.5 |

Median | 152.5 | 153 | 113 |

Q3 | 179.75 | 180.5 | 142.5 |

Max | 190 | 195 | 152 |

Earlier in the course, (when we discussed the distribution of a single categorical variable) we examined the data obtained when a random sample of 1,200 U.S. college students were asked about their body image (underweight, overweight, or about right.)

If we had separated our sample of 1,200 U.S. college students by gender and looked at males and females separately, would we have found a similar distribution across body-image categories? More specifically, are men and women just as likely to think their weight is about right? Among those students who do not think their weight is about right, is there a difference between the genders in feelings about body image?

Answering these questions requires us to examine the relationship between two categorical variables, gender and body image. Because the question of interest is whether there is a gender effect on body image,

the explanatory variable is gender, and

the response variable is body image.

Here is what the raw data look like when we include the gender of each student

To start our exploration of how body image is related to gender, we need an informative display that summarizes the data. In order to summarize the relationship between two categorical variables,

**we create a display called a two-way table**.The table has the possible genders in the rows, and the possible responses regarding body image in the columns. At each intersection between row and column, we put the counts for how many times that combination of gender and body image occurred in the data. We sum across the rows to fill in the Total column, and we sum across the columns to fill in the Total row.

How many females in our sample feel that they are underweight? 37; Since we want those students who are both female and responded underweight, we are looking for the intersection between the female row and the underweight column.

How many males in our sample feel that they are about right? 295; Since we want the number who are both male and responded about right, we are looking for the intersection between the male row and the about right column.

What is the total number of females in our sample? The number in the total column that is in the female row is 760.

What is the total number of students who feel they are overweight? The total of the overweight column is 235.

Note that from the way the two-way table is constructed, the Total row or column is a summary of one of the two categorical variables, ignoring the other. The Total row gives the summary of the categorical variable body image. The Total column gives the summary of the categorical variable gender.

Remember, though, that our primary goal is to explore how body image is related to gender. Exploring the relationship between two categorical variables (in this case, body image and gender) amounts to comparing the distributions of the response variable (in this case, body image) across the different values of the explanatory variable (in this case, males and females)

Note that it doesn't make sense to compare raw counts, because there are more females than males overall. So, for example, it is not very informative to say, "There are 560 females who responded 'about right' compared to only 295 males," since the 560 females are out of a total of 760, and the 295 males are out of a total of only 440.

We need to supplement our display, the two-way table, with some numerical summaries that will allow us to compare the distributions. These numerical summaries are found by simply converting the counts to percentages within (or restricted to) each value of the explanatory variable separately.

In our example, we look at each gender separately and convert the counts to percentages within that gender. Let's start with females

Note that each count is converted to a percentage by dividing by the total number of females, 760. These numerical summaries are called conditional percentages, since we find them by "conditioning" on one of the genders.

In our example, we chose to organize the data with the explanatory variable gender in rows and the response variable body image in columns, and thus our conditional percentages were row percentages, calculated within each row separately.

**Similarly, if the explanatory variable happens to sit in columns and the response variable in rows, our conditional percentages will be column percentages**, calculated within each column separately. For an example, see the "Did I Get This?" exercises below.Another way to visualize the conditional percentages, instead of in a table,

**is to use a double bar chart. This display is quite common in newspapers.**

Case Q→Q is different in the sense that both variables (in particular the explanatory variable) are quantitative, and therefore, as you'll discover, this case will require a different kind of treatment and tools.

A Pennsylvania research firm conducted a study in which 30 drivers (of ages 18 to 82 years old) were sampled, and for each one, the maximum distance (in feet) at which he/she could read a newly designed sign was determined. The goal of this study was to explore the relationship between a driver's age and the maximum distance at which signs were legible, and then use the study's findings to improve safety for older drivers.

Since the purpose of this study is to explore the effect of age on maximum legibility distance,

**the explanatory variable is Age, and the response variable is Distance.**Note that the data structure is such that for each individual (in this case driver 1....driver 30) we have a pair of values (in this case representing the driver's age and distance). We can therefore think about these data as 30 pairs of values: (18, 510), (32, 410), (55, 420), ... , (82, 360)

The first step in exploring the relationship between driver age and sign legibility distance is to create an appropriate and informative graphical display.

**The appropriate graphical display for examining the relationship between two quantitative variables is the scatterplot. Here is how a scatterplot is constructed for our example**To create a scatterplot, each pair of values is plotted, so that the

**value of the explanatory variable (X) is plotted on the horizontal axis, and the value of the response variable (Y) is plotted on the vertical axis**. In other words, each individual (driver, in our example) appears on the scatterplot as a single point whose X-coordinate is the value of the explanatory variable for that individual, and whose Y-coordinate is the value of the response variable.It is important to mention again that when creating a scatterplot, the explanatory variable should always be plotted on the horizontal X-axis, and the response variable should be plotted on the vertical Y-axis. If in a specific example we do not have a clear distinction between explanatory and response variables, each of the variables can be plotted on either axis.

As the figure explains, when describing the overall pattern of the relationship we look at its direction, form and strength. The direction of the relationship can be positive, negative, or neither. A positive (or increasing) relationship means that an increase in one of the variables is associated with an increase in the other. A negative (or decreasing) relationship means that an increase in one of the variables is associated with a decrease in the other. Not all relationships can be classified as either positive or negative.

The form of the relationship is its general shape. When identifying the form, we try to find the simplest way to describe the shape of the scatterplot. There are many possible forms. Relationships with a linear form are most simply described as points scattered about a line. Relationships with a curvilinear form are most simply described as points dispersed around the same curved line

There are many other possible forms for the relationship between two quantitative variables, but linear and curvilinear forms are quite common and easy to identify. Another form-related pattern that we should be aware of is clusters in the data

The strength of the relationship is determined by how closely the data follow the form of the relationship. Let's look, for example, at the following two scatterplots displaying

**positive, linear relationships**We can see that in the top scatterplot the data points follow the linear pattern quite closely. This is an example of a strong relationship. In the bottom scatterplot, the points also follow the linear pattern, but much less closely, and therefore we can say that the relationship is weaker. In general, though, assessing the strength of a relationship just by looking at the scatterplot is quite problematic, and we need a numerical measure to help us with that.

Data points that deviate from the pattern of the relationship are called outliers.

**We will see several examples of outliers during this section. Two outliers are illustrated in the scatterplot below**

Let's go back now to our example, and use the scatterplot to examine the relationship between the age of the driver and the maximum sign legibility distance. Here is the scatterplot:**The direction of the relationship is negative**, which makes sense in context, since as you get older your eyesight weakens, and in particular older drivers tend to be able to read signs only at lesser distances. An arrow drawn over the scatterplot illustrates the negative direction of this relationship.The form of the relationship seems to be linear. Notice how the points tend to be scattered about the line. Although, as we mentioned earlier, it is problematic to assess the strength without a numerical measure, the relationship appears to be moderately strong, as the data is fairly tightly scattered about the line.

**Finally, all the data points seem to "obey" the pattern—there do not appear to be any outliers.**