• Purpose of Statistics Package Exercises : The Probability & Statistics course focuses on the processes you use to convert data into useful information. This involves

1. Collecting data,

2. Summarizing data, and

3. Interpreting data.

• In addition to being able to apply these processes, you can learn how to use statistical software packages to help manage, summarize, and interpret data. The statistics package exercises included throughout the course provide you the opportunity to explore a dataset and answer questions based on the output using R, Statcrunch, TI Calculator, Minitab, or Excel. In each exercise, you can choose to view instructions for completing the activity in R, Statcrunch, TI Calculator, Minitab, or Excel, depending on which statistics package you choose to use.

• The statistics package exercises are an extension of activities already embedded in the course and require you to use a statistics package to generate output and answer a different set of questions.

1. To download R, a free software environment for statistical computing and graphics, go to: https://www.r-project.org/ This link opens in a new tab and follow the instructions provided.

• Using R

1. Throughout the statistics package exercises, you will be given commands to execute in R. You can use the following steps to avoid having to type all of these commands in by hand:

2. Highlight the command with your mouse.

3. On the browser menu, click "Edit," then "Copy."

4. Click on the R command window, then at the top of the R window, click "Edit," then "Paste."

5. You may have to press to execute the command.

• R Version

1. The R instructions are current through version 3.2.5 released on April 14, 2016. Instructions in these statistics package exercises may not work with newer releases of R.

2. For help with installing R for MAC OS X or Windows click here

• Background An Associated Press article captured the attention of readers with the headline "Night lights bad for kids?" The article was based on a 1999 study at the University of Pennsylvania and Children's Hospital of Philadelphia, in which parents were surveyed about the lighting conditions under which their children slept between birth and age 2 (lamp, night-light, or no light) and whether or not their children developed nearsightedness (myopia). The purpose of the study was to explore the effect of a young child's nighttime exposure to light on later nearsightedness.

• In this activity, we will use the collected data to:

1. learn how to build a two-way table and compute conditional percentages.

2. interpret the data in terms of the relationship between a young child's nighttime exposure to light and later nearsightedness.

• R Instructions :

• The data have been loaded into the data frame nightlight

• . Enter the command

1. nightlight

• to see the data. There are two variables: Light and Nearsightedness .

• You should see a report about 479 subjects with two columns. The first column records what kind of light, if any, was used in the subjects’ bedrooms while they slept, and the second column indicates whether or not the subjects later became nearsighted.

• Since there are only two variables in the data frame, we can obtain a summary of the data in a two-way table by using the table() command on data frame name:

1. t = table(nightlight);
t

• If the data frame had more than two variables then specific variables must be identified within the table() command, with the first entry representing the row values and the second entry representing the column values.

1. tt = table(nightlight$Light,nightlight$Nearsightedness);
tt

• To create a table of conditional proportions showing which proportion of children in each treatment group became nearsighted, we must divide each table entry by the total number of entries in its row. To do this, copy the following command to R:

1. prop.table(t,1)

• To create a table of conditional proportions showing which proportion of children who are nearsighted or not were in each treatment group, we must divide each table entry by the column totals. To do this, copy the following command to R:

1. prop.table(t,2)

• To create a table of conditional percentages, simply muliply the above commands by 100, like so:

1. prop.table(t,1)*100


Q. Compare the distribution of nearsightedness between those exposed to lamp light, a night-light, or no light at all. What do the conditional percentages derived from the data suggest about the relationship between early nighttime exposure to light and later myopia? Should parents worry about using night-lights and lamps with young children?

• Of the three groups of children, those who were exposed to no light at all were least likely to be nearsighted, with an incidence of 9.9%. Those exposed to a night-light were nearsighted 34.1% of the time, and those exposed to a lamp were nearsighted 54.7% of the time. Note that these percentages show that children who slept with a lamp were about 5 times more likely to develop nearsightedness than the children who slept with no light.Based upon this data alone, parents might discontinue using night-lights and lamps with young children, as their use seems to be associated with nearsightedness when the children grow up. However, there is a strong argument against this conclusion, which will be presented in a later lesson.

• In this exercise, we will : learn how to create a scatterplot.

1. use the scatterplot to examine the relationship between two quantitative variables.

2. learn how to create a labeled scatterplot.

3. use the labeled scatterplot to better understand the form of a relationship.

• In this activity, we look at height and weight data that were collected from 57 males and 24 females, and use the data to explore how the weight of a person is related to (or affected by) his or her height. This implies that height will be our explanatory variable and weight will be our response variable. We will then look at gender, and see how labeling this third variable contributes to our understanding of the form of the relationship.

• R Instructions

2. The data have been loaded into the data frame h . Enter the command

1. 
h

3. to see the data. There are three variables in h : gender , height , and weight .

4. The variables are identified as follows: gender : 0 = male, 1 = female. height : in inches. weight : in pounds.

5. First we will create a scatterplot to examine how weight is related to height, ignoring gender.

6. To do that in R, copy the following command to R:

1. 
plot(h$height,h$weight)

7. Again, a good graphic should have labels so lets add x and y-axis labels:

1. 
plot(h$height,h$weight, xlab="Height (inches)", ylab="Weight (lbs)")


Q. Describe the relationship between the height and weight of the subjects suggested by the data. Consider the pattern of the data—mainly direction and form—and any deviations from this pattern, such as outliers.

• The direction of the relationship is positive. In context, this means that individuals who are taller are heavier. The form of the relationship is curvilinear. Weight seems to increase more and more rapidly with height as we consider taller individuals. We might say that the relationship is moderate in strength, because the points suggest, but do not closely follow, a curvilinear form. There do not appear to be any outliers.

• So far we have studied the relationship between height and weight for all of the males and females together. It may be interesting to examine whether the relationship between height and weight is different for males and females. To visualize the effect of the third variable, gender, we will indicate in the scatterplot which observations are males and which are females.

• R Instructions

1. To do that with R, we change the color of the data points representing females to red:

1. 
plot(h$height,h$weight, xlab="Height (inches)", ylab="Weight (lbs)")
points(h$height[h$gender==1],h$weight[h$gender==1],col="red")

2. You can make a nicer looking plot with males shown in blue, females in red, and labels telling which is which:

1. 
plot(h$height,h$weight, xlab="Height (inches)", ylab="Weight (lbs)",col="blue")points(h$height[h$gender==1],h$weight[h$gender==1],col="red")legend(55,225, pch=1, col=c("red","blue"),legend=c("females","males"))

3. Note: To look up more details about legend() , simply type

1. 
?legend

4. into R and press Enter to get the help information about the function.

Q. Consider how taking gender into account adds to our understanding of the overall form of the relationship. Address any interesting differences between the clusters associated with the two genders. More specifically, address the following two questions:(A) Where do the males and females fall on the scatterplot with respect to height and weight. Explain why you think this is so. (B) Does it look like the weight of females increases with an increase in height as quickly as the weight of males increases with a corresponding increase in height?

• We see that the overall form of the relationship can be thought of in terms of two clusters, one for each gender. Points corresponding to the females cluster in the lower left of the scatterplot, which means that females generally have a lower height and weight than males (due to obvious biological differences between males and females). The main difference between males and females is that the weight of females does not appear to increase as quickly with height as the weight of males does.

• In this activity we will:

1. learn how to compute the correlation (r).

2. practice interpreting the value of the correlation.

3. see an example of how including an outlier can increase the correlation.

• Recall the following example: The average gestation period, or time of pregnancy, of an animal is closely related to its longevity —the length of its lifespan. Data on the average gestation period and longevity (in captivity) of 40 different species of animals have been recorded.

• R Instructions

2. The data have been loaded into the data frame a . Enter the command

1. 
a

3. to see the data. The variables in a are animal , gestation , and longevity .

1. 
animal

4. : the name of the animal species

1. 
gestation

5. : the average gestation period of the species, in days

1. 
longevity

6. : the average longevity of the species, in years

7. Notice that the correlation between gestation and longevity has changed.

8. Remember that the correlation is only an appropriate measure of the linear relationship between two quantitative variables. First produce a scatterplot to verify that gestation and longevity are nearly linear in their relationship.

• R Instructions

1. To do this in R, copy the command:

1. 
plot(a$longevity,a$gestation,xlab="Average Longevity of Species (years)", ylab="Average Gestation Period of Species (days)")


2. Observe that the relationship between gestation period and longevity is linear and positive. Now we will compute the correlation between gestation period and longevity.

• R Instructions

1. To do that in R, copy the command:

1. 
cor(a$longevity,a$gestation)

2. Now return to the scatterplot that you created earlier. Notice that there is an outlier in both longevity (40 years) and gestation (645 days). Note: This outlier corresponds to the longevity and gestation period of the elephant.

3. Now return to the scatterplot that you created earlier. Notice that there is an outlier in both longevity (40 years) and gestation (645 days). Note: This outlier corresponds to the longevity and gestation period of the elephant.

4. What do you think will happen to the correlation if we remove this outlier?

• R Instructions

1. To do this in R, copy the following command:

1. 
cor(a$longevity[a$animal!="elephant"],a$gestation[a$animal!="elephant"])

2. Notice that the correlation between gestation and longevity has changed.

• In this activity we will:

1. find a regression line and plot it on the scatterplot.

2. examine the effect of outliers on the regression line.

3. use the regression line to make predictions and evaluate how reliable these predictions are.

• Background : The modern Olympic Games have changed dramatically since their inception in 1896. For example, many commentators have remarked on the change in the quality of athletic performances from year to year. Regression will allow us to investigate the change in winning times for one event—the 1,500 meter race.

• R Instructions

2. The data have been loaded into the data frame olym . Enter the command

1. 
olym

3. to see the data. The data frame olym has the following variables: Year , Time .

4. Here is a description of the variables:

5. Year : the year of the Olympic Games, from 1896 to 2012.

6. Time : the winning time for the 1,500 meter race, in seconds.

7. First, let’s explore the relationship between the two quantitative variables: year and time . Produce a scatterplot and use it to verify that year and time are nearly linear in their relationship.

8. To do this in R, copy the command:

1. 
plot(olym$Year, olym$Time, xlab="Year of Olympic Games",ylab="Winning Time of 1500m Race (secs)")

9. Observe that the form of the relationship between the 1,500 meter race's winning time and the year is linear. The least squares regression line is therefore an appropriate way to summarize the relationship and examine the change in winning times over the course of the last century. We will now find the least squares regression line and plot it on a scatterplot.

• R Instructions

• In order to fit the regression line, we use the command

1. 
lm()
.

• The lm() command produces a large amount of information, which we will want to extract as we need it, so we save the information to another variable name model.

1. 
model = lm(olym$Time~olym$Year)

• To add the regression line to the scatterplot, we can extract the linear equation from model and add the line to the scatterplot. To do this in R, copy the entire command below:

1. 

plot(olym$Year, olym$Time, xlab="Year of Olympic Games", ylab="Winning Time of 1500m Race (secs)");abline(model)

• We can also extract the y-intercept and slope from the model to determine the regression equation. To do this in R, copy the command:

1. 
coef(model)


Q. Give the equation for the least squares regression line, and interpret it in context.

• The equation for the least squares regression line is: Time = 916 - 0.35 * Year. The slope of the line indicates that the winning time for the 1500 meter race decreases by about 0.35 seconds every year, or by about 4 * 0.35 = 1.40 seconds, on average, from one Olympiad to the next. (The fact that the times are decreasing rather than increasing is also indicated by the fact that the value of b is negative.)

• R Instructions

• Notice that there is an outlier. Remove the outlier by copying these commands (for this exercise we will not modify x and y-axis labels).To do this copy each line separately and in order, hit enter to see the change:

1. 
plot(olym$Year[olym$Year!=1896], olym$Time[olym$Year!=1896])


2. 
L = lm(olym$Time[olym$Year!=1896]~olym$Year[olym$Year!=1896]);


3. 
abline(L);


4. 
cf=coefficients(L);


5. 
legend(1950,240,legend=paste("time = ",round(cf[1],0),round(cf[2],2),"year"))

• You will now see that the least squares regression line and the values in the equation have changed.

Q. Give the equation for this new line and compare it with the line you found for the whole dataset, commenting on the effect of the outlier.

• Once the outlier for the year 1896 is removed, the equation for the least squares regression line is: Time = 812 + (- 0.30 * Year) Note: Some statistics packages may show the regression line as: Time = 811 + (-0.30 * Year) When the outlier is removed, the line "drops" a bit—the intercept is smaller and the slope is not as negative. Both of these results are quite reasonable, since the original data were pulled upward toward the outlier. Once this outlier is removed, the line drops.

Q. Our least squares regression line associates years as an explanatory variable, with times in the 1,500 meter race as the response variable. Use the least squares regression line you found in question 2 to predict the 1,500 meter time in the 2016 Olympic Games in Rio de Janeiro. Comment on your prediction.

• 812 + (- 0.30 * 2016) = 207.20 seconds. This is an extrapolation. We cannot be sure that the linear dependence of winning times upon years holds past the range of the explanatory variable, which is the year 2012. At some point, the linear dependence must no longer apply, because it would predict impossible winning times.