• Purpose of Statistics Package Exercises : The Probability & Statistics course focuses on the processes you use to convert data into useful information. This involves

1. Collecting data,

2. Summarizing data, and

3. Interpreting data.

• In addition to being able to apply these processes, you can learn how to use statistical software packages to help manage, summarize, and interpret data. The statistics package exercises included throughout the course provide you the opportunity to explore a dataset and answer questions based on the output using R, Statcrunch, TI Calculator, Minitab, or Excel. In each exercise, you can choose to view instructions for completing the activity in R, Statcrunch, TI Calculator, Minitab, or Excel, depending on which statistics package you choose to use.

• The statistics package exercises are an extension of activities already embedded in the course and require you to use a statistics package to generate output and answer a different set of questions.

1. To download R, a free software environment for statistical computing and graphics, go to: https://www.r-project.org/ This link opens in a new tab and follow the instructions provided.

• Using R

1. Throughout the statistics package exercises, you will be given commands to execute in R. You can use the following steps to avoid having to type all of these commands in by hand:

2. Highlight the command with your mouse.

3. On the browser menu, click "Edit," then "Copy."

4. Click on the R command window, then at the top of the R window, click "Edit," then "Paste."

5. You may have to press to execute the command.

• R Version

1. The R instructions are current through version 3.2.5 released on April 14, 2016. Instructions in these statistics package exercises may not work with newer releases of R.

2. For help with installing R for MAC OS X or Windows click here

• The purpose of this activity is to discuss how in some cases exploratory data analysis can help you determine whether the conditions that allow us to use the z-test for the population mean (μ) are met.

• Background: In the Exploratory Data Analysis unit, we stressed that in general, it is always a good idea to look at your data (if the actual data are given). Moreover, related to our discussion now, looking at the data can be very helpful when trying to determine whether you can reliably use the test. In both of our leading examples, the data summaries (sample size, sample mean) were given rather than the raw data, but in practice, you are often working with the raw data. In example 1, we were told the SAT-M scores vary normally in the population, so even though the sample size (n = 4) was quite small, we could proceed with the test. In example 2, the sample size was large enough (n = 100) for us to proceed with the test even though we do not know whether the concentration level varies normally.

• Now imagine the following situation: A health educator at a small college wants to determine whether the exercise habits of male students in the college are similar to the exercise habits of male college students in general. The educator chooses a random sample of 20 male students and records the time they spend exercising in a typical week. Do the data provide evidence that the mean time male students in the college spend exercising in a typical week differs from the mean time for male college students in general (which is 8 hours)?

• Comment: Whether σ is known or not is really not relevant to this activity.

• Here is a situation in which we do not have any information about whether the variable of interest, "time" (time spent exercising in a typical week) varies normally or not, and the sample size (n = 20) is not really large enough for us to be certain that the Central Limit Theorem applies. Recall from our discussion on the Central Limit Theorem that unless the distribution of "time" is extremely skewed and/or has extreme outliers, a sample of size 20 should be fine. However, how can we be sure that is, indeed, the case?

• If only the data summaries are given, there is really not a lot that can be done. You can say something like: "I'll proceed with the test assuming that the distribution of the variable "time" is not extremely skewed and does not have extreme outliers." If the actual data are given, you can make a more informed decision by looking at the data using a histogram. Even though the histogram of a sample of size 20 will not paint the exact picture of how the variable is distributed in the population, it could give a rough idea.

• R Instructions

2. The data have been loaded into the data frame time . Enter the command

1. 
time

3. to see the data. There are 4 columns (named time1 through time4 ) representing 4 different samples of size 20 in the data frame.

4. To create a histogram of the column time1 with R, enter the command:

1. 
hist(time$time1,xlab="Time Exercising Per Week", main="Sample 1") 5. Now change time1 to time2 , time3 , and time4 in the code above to see the histogram for each time column. 6. Note: You can modify the x-label and title as you choose. 1. Explanation : Time 1: The histogram displays a roughly normal shape. For a sample of size 20, the shape is definitely normal enough for us to assume that the variable varies normally in the population and therefore it is safe to proceed with the test. Time 2: The histogram displays a distribution that is slightly skewed and does not have any outliers. The histogram, therefore, does not give us any reason to be concerned that for a sample of size 20 the Central Limit Theorem will not kick in. We can therefore proceed with the test. Time 3: The distribution does not have any "special" shape, and has one small outlier which is not very extreme (although it is arguable whether you would classify it as an outlier). Again, the histogram does not give us any reason to be concerned that for a sample of size 20 the Central Limit Theorem will not kick in. We can therefore proceed with the test. Time 4: The distribution is extremely skewed to the right, and has one pretty extreme high outlier. Based on this histogram, we should be cautious about proceeding with the test, because assuming that this histogram "paints" at least a rough picture of how the variable varies in the population, a sample of size 20 might not be large enough for the Central Limit Theorem to kick in and ensure that x̄ has a normal distribution. Comments: • It is always a good idea to look at the data and get a sense of their pattern regardless of whether you actually need to do it in order to assess whether the conditions are met. • This idea of looking at the data is not only relevant to the z-test, but to tests in general. In particular, we'll see that in the case where σ is unknown (which we'll discuss next) the conditions that allow us to safely use the test are the same as the conditions in this case, so the ideas of this activity directly apply to that case as well. • Also, as you'll see, in the next module—inference for relationships—doing exploratory data analysis before inference will be an integral part of the process. • The purpose of this activity is teach you to run the z-test for the population mean while exploring the effect of sample size on the significance of the results. • Background: • Recall example 1 that we've just completed: • Even though the sample mean was 550, which is substantially greater than the null value, 500, this result was not significant, since it was based on data obtained from only 4 students. In other words, the data did not provide enough evidence to reject Ho and conclude that the mean SAT-M score of all Ross College students is larger than 500, the national mean. If this sample mean were obtained from 5 students, would that result be significant? If not, would 6 be enough? In other words, what is the smallest sample size for which a sample of $$\overline{x} = 550$$ would be significant? In this activity, we will use statistical software to explore this question. • Comment: If you think about it, this question is not very practical, because you do not know in advance what the sample mean will be, but it is intuitive enough that it will help you get a better sense of how the sample size affects the significance of the results. • R Instructions 1. For n = 4, we know that the result x̄ = 550 is not significant. Using the R instructions below and starting with n = 5 (and going up), create and fill in a table where the column headings are "n," "z (test statistic)," "p-value," and "significant at the 0.05 level (yes/no)." Stop after the first time your result becomes significant at the 0.05 significance level. 2. We can use the following code to test the significance of the test under different sample size scenarios, n = 5 to n = 15. 3. Create a list of numbers 5,6,7,...,15 to represent various sample sizes: 1. n=5:15 4. Calculate the z-scores for each sample size: 1. z=(550-500)/(100/sqrt(n)) 5. Calculate the p-values. 1. pvalue=1-pnorm(z) 6. Test whether the p-values are significant at 0.05. TRUE=Significant, FALSE=Not Significant: 1. significance=c(pvalue<0.05) 7. Compile the information into a table: 1. data.frame(n,z,pvalue,significance) 1. Explanation : For a sample mean of 550 to be a significant result, it needs to be obtained from a sample of size at least 11.  • The purpose of this activity is to give you guided practice in going through the whole process of hypothesis testing for the population mean (assuming that σ is known). • Background: - The length of human pregnancy is known to have a mean of 266 days and a standard deviation of 16 days. Based on records from a large women's hospital, a random sample of 25 women who were smoking and/or drinking alcohol during their pregnancy and their pregnancy lengths are recorded in the datafile linked below. Do the data provide enough evidence to support the (well-known) fact that women who smoke and/or drink alcohol during their pregnancy have shorter pregnancies than women in general (in other words, are more likely to have premature labor)? • Comment: - It is reasonable to assume that the known standard deviation of 16 days applies also to women who smoke and/or drink during their pregnancy. • R Instructions 1. To open R with the data set preloaded, right-click here and choose "Save Target As" to download the file to your computer. Then find the downloaded file and double-click it to open it in R. 2. The data have been loaded into the data frame pregnancy . Enter the command 1. pregnancy 3. to see the data. The variable name in the data frame is length • R Instructions 1. The R base package does not have a z-test command built in so we must either create our own z-test command function or use an add-on package. We will create our own function z.test() . The parameters for z.test() are: 1. x : a sample data set 2. sig : the population standard deviation 3. mu0 : the mean hypothesized by the null hypothesis 4. alt : either "less" or "greater" or "two.sided", indicating the form of the alternative hypothesis μ < μ_0, μ > μ_0, or μ ≠ μ_0 respectively. The default is "greater" meaning that if you do not identify alt it will automatically conduct the "greater than" hypothesis test. 2. Copy this function to R as a single block of code: 1. z.test = function(x, sig, mu0, alt="greater") { mu = mean(x); n = length(x); z = (mu-mu0)/(sig/sqrt(n)); if (alt=="less"){p = pnorm(z)} else { if (alt=="two.sided"){p = 2*(1-pnorm(abs(z)))} else {p = 1-pnorm(z)} } paste("mean = ",mu,"n = ",n,", z = ",z,", p-value = ",round(p,5)) } 3. The function z.test() returns the sample mean, the sample size, the z-score for the sample and the p-value for the hypothesis test. To use the function z.test() on the data already entered, execute the command: 1. z.test(x=pregnancy$length, sig=16, mu0=266, alt="less")

4. Note: We use this same command for other "Learn By Doing" activities, so be aware that if you restart R before the next module, you must reenter the block of code that defines the z.test() function into R to use it again. Otherwise, you will get the following error:

5. Error: could not find function "z.test"

1. Explanation :
Based on the output, the sample mean pregnancy length of the 25 women is 259.68. In addition, we know that: n = 25, μ = 266 and σ = 16. The test statistic is therefore: as given in the output. This means that the sample mean is almost 2 standard deviations below the null value.

1. Explanation :
The p-value is 0.024, which (using the 0.05 significance level) is small enough to indicate that the results are significant. In other words, the data provide enough evidence to reject Ho and conclude that the mean pregnancy length of women who smoke and/or drink alcohol during pregnancy is smaller than the mean pregnancy length of women in general.

• The purpose of this activity is to give you guided practice in the process of a t-test for the population mean and to teach you how to perform this test using statistical software.

• Background: - A group of 75 college students from a certain liberal arts college were randomly sampled and asked about the number of alcoholic drinks they have in a typical week. The file containing the data is linked below. The purpose of this study This link opens in a new tab was to compare the drinking habits of the students at the college to the drinking habits of college students in general. In particular, the dean of students, who initiated this study, would like to check whether the mean number of alcoholic drinks that students at his college have in a typical week differs from the mean of U.S. college students in general, which is estimated to be 4.73.

• R Instructions

2. The data have been loaded into the data frame drinks . Enter the command

1. drinks

3. to see the data. The variable in the data frame is drinks.per.week .

4. To use R to perform a t-test for the population mean using our data and the designated alternate hypotheses, enter the command:

1. t.test(drinks\$drinks.per.week,mu = 4.73, alternative = "two.sided")

5. R returns all the information you need to complete your t-test, including:

1. The sample mean (mean of x)

2. The t-test statistic of the sample (t)

3. The degrees of freedom (df) (one less than the sample size)

4. The alternative hypothesis

5. The p-value of the test (p-value)

6. A confidence interval (default is 95%; see note)

6. Note: Using R, the possible values for the alternative hypothesis are "less" , "greater" , and "two.sided" , corresponding to the three types of alternative hypotheses in a t-test.

7. If "two.sided" is the alternative, then the appropriate confidence interval is returned.

8. If "less" or "greater" is the alternative, then the returned confidence interval is what is known as a one-sided confidence interval, which we have not discussed.

1. Explanation :
The test statistic, t = -1.83, is calculated as follows: The sample mean is 1.83 standard errors below the null value.

1. Explanation :
The p-value is 0.072, which at the 0.05 significance level indicates that the results are not significant. The data, therefore, do not provide enough evidence to reject Ho and conclude that the mean number of alcoholic drinks that students at the college consume in a typical week is different from 4.73, the mean of college students in general.