Purpose of Statistics Package Exercises : The Probability & Statistics course focuses on the processes you use to convert data into useful information. This involves
Summarizing data, and
In addition to being able to apply these processes, you can learn how to use statistical software packages to help manage, summarize, and interpret data. The statistics package exercises included throughout the course provide you the opportunity to explore a dataset and answer questions based on the output using R, Statcrunch, TI Calculator, Minitab, or Excel. In each exercise, you can choose to view instructions for completing the activity in R, Statcrunch, TI Calculator, Minitab, or Excel, depending on which statistics package you choose to use.
The statistics package exercises are an extension of activities already embedded in the course and require you to use a statistics package to generate output and answer a different set of questions.
To Download R
To download R, a free software environment for statistical computing and graphics, go to: https://www.r-project.org/ This link opens in a new tab and follow the instructions provided.
Throughout the statistics package exercises, you will be given commands to execute in R. You can use the following steps to avoid having to type all of these commands in by hand:
Highlight the command with your mouse.
On the browser menu, click "Edit," then "Copy."
Click on the R command window, then at the top of the R window, click "Edit," then "Paste."
You may have to press
The R instructions are current through version 3.2.5 released on April 14, 2016. Instructions in these statistics package exercises may not work with newer releases of R.
For help with installing R for MAC OS X or Windows click here
The purpose of this activity is to explore the effectiveness of randomization in creating similar treatment groups, in the sense that it balances the groups with respect to other variables that we didn't control for.
Background A local internet service provider (ISP) created two new versions of its software, with alternative ways of implementing a new feature. To find the product that would lead to the highest satisfaction among customers, the ISP conducted an experiment comparing users' preferences for the two new versions versus the existing software.
The ISP ideally wants to find out which of the three software products causes the highest user satisfaction. It has identified three major potential lurking variables that might affect user satisfaction—gender, age, and hours per week of computer use.
In this activity, we will use adults in a hypothetical city as the population of interest to the ISP. We will:
create a simple random sample as the basis for the experimental study of the population, use randomization to assign individuals to treatment groups, and verify that randomization prevented the three treatment groups from being different with respect to the most obvious lurking variables.
To open R with the dataset preloaded, right-click here and choose "Save Target As" to download the file to your computer. Then find the downloaded file and double-click it to open it in R.
The data have been loaded into the data frame computers . The data frame has three variables: age , gender , comp .
This dataset has more than 20,000 entries, so we will not display it. To verify that the data have been loaded into the variable computers, copy the following command to R:
You can see information about the age of the population and the number of hours (comp) per week that the subjects use computers. You also see that there are more than 10,000 men and more than 10,000 women in the population (gender) .
Our dataset contains the values of the three possible lurking variables:
age: in years
gender: female or male
comp: hours per week of computer use
The company must rely upon sampling to study its customers' preferences, since the entire population cannot be assigned to treatments. Therefore, we will first choose a simple random sample (SRS) of 450 people for the subjects in the study.
To choose the sample, copy the following command to R:
random_sample = computers[sample(length(computers$age), 450),]
Again, we do not wish to view all 450 entries in the random sample. Instead, let's look at a summary of the sample by copying the following command:
By looking at the numbers of males and females in the sample, we see that the sample indeed has 450 entries.
Now we will randomly assign our SRS of 450 subjects to treatment groups, one for each of the three versions of the ISP's software. Let's denote the versions "1," "2," and "3," and create a categorical variable to identify the treatment for each subject.
To use R to randomly assign the 450 subjects to one of three treatments, copy the following commands (note these are two separate command lines):
group = sample(1:3,450,replace=T)
random_sample = cbind(random_sample,group)
Note: Using R—The variable group is a list of 450 randomly chosen values from 1, 2, and 3. You can see them by entering the variable name group into R. These commands add a fourth column to the variable random_sample .
The four columns are age , gender , comp and group .
We are finally reaching the goal of this activity. We will now examine whether the randomization was successful in making our three treatment groups similar with respect to the variables age, gender, and comp. In other words, we will now examine whether the distributions of these variables in the three groups are similar or not.
To compare the distribution of age among the three treatment groups, we'll create side-by-side boxplots of age by treatment.
Copy the following command to R:
boxplot(random_sample$age~random_sample$group, xlab="Group", ylab="Age (years)")
To compare the distribution of gender among the three treatment groups, we'll look at a two-way table of conditional percents:
Copy the following commands to R (note these are two separate command lines):
two_way_table = table(random_sample$group,random_sample$gender)
To compare the distribution of comp (the hours per week of computer use) among the three treatment groups, we'll create side by side boxplots of comp by treatment. Follow the instructions above, making the obvious necessary changes
Q. Comment on the displays you created. In particular, are the distributions of age, gender, and comp in the three treatment groups similar?
Everyone will get slightly different displays here, but they should all "look" about the same.
Based upon the side-by-side boxplots, the distribution of ages and hours per week of computer use appears the same in each of the three treatment groups.
Similarly, the table of conditional percents suggests that the distribution of the genders is about the same in all three treatment groups.
Q. Based on your answer to the question above, does the randomization allow us to study the differences in user preferences between the three browsers, while eliminating the possible effects of the lurking variables age, gender, and hours per week of computer use? Comment below:
Our results suggest that the distributions of age, gender, and hours per week of computer use among the three treatments are about the same;
therefore, the randomization was successful in balancing these three potential lurking variables among the three treatment groups.
We can be fairly sure that any difference between the treatment groups that we find on the user tests of the software will be due to differences in the three software versions, rather than the lurking variables of age, gender, and hours per week of computer use.