• Purpose of Statistics Package Exercises : The Probability & Statistics course focuses on the processes you use to convert data into useful information. This involves


    1. Collecting data,


    2. Summarizing data, and


    3. Interpreting data.


  • In addition to being able to apply these processes, you can learn how to use statistical software packages to help manage, summarize, and interpret data. The statistics package exercises included throughout the course provide you the opportunity to explore a dataset and answer questions based on the output using R, Statcrunch, TI Calculator, Minitab, or Excel. In each exercise, you can choose to view instructions for completing the activity in R, Statcrunch, TI Calculator, Minitab, or Excel, depending on which statistics package you choose to use.


  • The statistics package exercises are an extension of activities already embedded in the course and require you to use a statistics package to generate output and answer a different set of questions.


  • To Download R


    1. To download R, a free software environment for statistical computing and graphics, go to: https://www.r-project.org/ This link opens in a new tab and follow the instructions provided.


  • Using R


    1. Throughout the statistics package exercises, you will be given commands to execute in R. You can use the following steps to avoid having to type all of these commands in by hand:


    2. Highlight the command with your mouse.


    3. On the browser menu, click "Edit," then "Copy."


    4. Click on the R command window, then at the top of the R window, click "Edit," then "Paste."


    5. You may have to press to execute the command.


  • R Version


    1. The R instructions are current through version 3.2.5 released on April 14, 2016. Instructions in these statistics package exercises may not work with newer releases of R.


    2. For help with installing R for MAC OS X or Windows click here






  • The purpose of this activity is to learn how to use statistical software for calculating confidence intervals for μ (when σ is known). Software is particularly useful when all you have are the raw data (no summary has been calculated), which is what you encounter in practice. In all the examples and activities we looked at so far, the sample mean is given (rather than the whole data set), in which case it will often take you less time to calculate the confidence interval by hand than to launch a software program and ask it to do the calculation for you.


  • Background: Some studies suggest that women having their first baby at age 35 or older are at increased risk of having a baby with a low birth weight. A medical researcher wanted to estimate μ, the mean weight of newborns who are the first child for women over the age of 35. To this end, the researcher chose a random sample of 125 women ages 35 and older who were pregnant with their first child and followed them through the pregnancy. The datafile linked below contains the birth weight (in grams) of the 125 newborns (women pregnant with more than one child were excluded from the study). From past research, it is assumed that the weight of newborns has a standard deviation of σ = 500 grams. We will estimate μ with a 99% confidence interval.


  • R Instructions


    1. To open R with the data set preloaded, right-click here and choose "Save Target As" to download the file to your computer. Then find the downloaded file and double-click it to open it in R.


    2. The data have been loaded into the data frame birthweight .


    3. Enter the command


      1. 
        birthweight

    4. to see the data. The variable name in the data frame is also birthweight .


    5. In R, there is no specific command to calculate a z-based confidence interval. We will use other functions in R to calculate the components of the confidence interval.


    6. First, we will reassign the variable of interest to the variable name x .


      1. 
        x=birthweight$birthweight

    7. Now we will calculate the necessary sample statistics, such as the sample mean, x̄ , the sample size, n , and the z critical value:


    8. Set your Confidence Level, C , and z critical value, z :


      1. 
        C=0.99
        z=qnorm((1+C)/2)

    9. Set your Population Standard Deviation, σ :


      1. 
        σ=500

    10. Calculate the sample mean, xbar , and sample size, n :


      1. 
        n=length(x)
        xbar=mean(x)

    11. Now we can construct the confidence interval. Lower Bound:


      1. 
            xbar-z*(σ/sqrt(n))

    12. Upper Bound:


      1. 
            xbar+z*(σ/sqrt(n))

    13. To calculate a confidence interval with a different confidence level, simply modify the confidence level C value and rerun each step of the code. To calculate a confidence interval for a new variable, simply reassign x and σ and set C to the appropriate confidence level and rerun each step of the code.





  1. Explanation :
    Thus, we are 99% confident that the mean birth weight of first babies born to mothers who are 35 or older is between 2,996 and 3,227 grams





  1. Explanation :
    Since our 99% confidence interval (which provides a set of plausible values for μ) all lies below 3,450, we can conclude (with 99% certainty) that the mean birth weight of babies who are the first child born to mothers who are 35 or older is lower than the mean birth weight in the general population.






  • The purpose of this activity is to learn how to use statistical software for calculating confidence intervals for μ when σ is unknown. In this case, using software is useful in the case when all you have is the raw data, and in the case when only summary statistics are provided (more specifically, the sample mean and the sample standard deviation s). The reason for using software in the latter case is, as mentioned before, the complexity of determining the appropriate t*.


  • Background: As part of a large survey conducted at a large state university, a random sample of 142 students were asked: "How many hours do you sleep in a typical day?" The datafile linked below contains the data. Use these data to estimate μ, the mean number of hours college students at this university sleep in a typical day, with a 95% confidence interval.


  • R Instructions


    1. To open R with the data set preloaded, right-click here and choose "Save Target As" to download the file to your computer. Then find the downloaded file and double-click it to open it in R.


    2. The data have been loaded into the data frame sleep . Enter the command


      1. 
        sleep

    3. to see the data. The variable in the data frame is also called sleep .


    4. Note that in this case we need to find a confidence interval for the population mean when the population variance is unknown, and we will therefore use the interval that we have just introduced (t-interval). R will do this for us. Enter the command:


      1. 
        t.test(sleep$sleep, conf.level=0.95)$conf.int

    5. To change the confidence level, simply change the 0.95 in the code to the appropriate level (in decimal form) and rerun the code.





  1. Explanation :
    We are 95% certain that the mean number of hours college students in this state university sleep in a typical day is between 7.09 and 7.62.






  • The purpose of this activity is to learn how to use software to calculate a confidence interval for the population proportion p (for a given level of confidence). Software is particularly useful when raw data are given rather than data summaries, which are what you would usually encounter in practice.


  • Background: The U.S. federal ban on assault weapons expired in September 2004, which meant that after 10 years (since the ban was instituted in 1994) there were certain types of guns that could be manufactured legally again. A poll asked a random sample of 1,200 eligible voters (among other questions) whether they were satisfied with the fact that the law had expired. The datafile linked below contains the results of this poll (Data were generated based on a poll conducted by NBC news/Wall Street Journal Poll). We would like to estimate p, the proportion of U.S. eligible voters who were satisfied with the expiration of the law, with a 95% confidence interval.


  • Before analyzing the data, answer the following question:





  1. Explanation :
    The margin of error of this poll is 1 / sqrt(1,200) = .0289, which equals approximately 2.9%.






  • R Instructions


    1. To open R with the data set preloaded, right-click here and choose "Save Target As" to download the file to your computer. Then find the downloaded file and double-click it to open it in R.


    2. The data have been loaded into the data frame support . Enter the command


      1. 
        support

    3. to see the data. The variable name in the data frame is opinion .


    4. This file contains 1,200 responses in the form "satisfied" and "not satisfied," which is too much to display. To check that all the data are loaded, count the data with the command:


      1. 
        n=length(support$opinion);n

    5. R should return 1,200 as the number of data. This is the sample size.


    6. We can get the summary counts using the table() command.


      1. 
        t=table(support$opinion);t

    7. Notice the first entry in the table is the "not satisfied" opinion. If we just wanted that term we can extract it from the table:


      1. 
        t[1]

    8. We can also extract the second entry, "satisfied," from the table:


      1. 
        t[2]

    9. To calculate confidence intervals for a single proportion we can use the command


      1. 
        prop.test()

    10. . Here is the 95% confidence interval for the proportion of the population that is "not satisfied" with the expired law:


      1. 
        prop.test(t[1],n,conf.level=0.95)

    11. Or we can calculate the 95% confidence interval for the proportion of the population that is "satisfied" with the expired law:


      1. 
        prop.test(t[2],n,conf.level=0.95)$conf.int

    12. Note: In R, be aware of the structure of your data at all times. In the created table t , the first [1] entry represented the "not satisfied" count, which is why t[1] extracted that value.





  1. Explanation :
     142 of the 1,200 sampled voters answered that they were satisfied.





  1. Explanation :
    The sample proportion is therefore 0.118, or roughly 12% (142/1,200)





  1. Explanation :
    The 95% confidence interval for p is (0.10, 0.14) (rounded). We are 95% certain that the proportion of U.S. voters who were satisfied with the expiration of the federal ban on assault weapons is between 0.10 and 0.14 (or between 10% and 14%)





  1. Explanation :
    The 95% confidence interval for p, (0.10,0.14) has a width of 0.04, and therefore a margin of error of 0.02 (or 2%). In the first question, though, we found that the margin of error of this poll is roughly 2.9%. This is because in the first question, we calculated a "conservative" margin of error. The margin of error found in the first question was found using the conservative approach, and is the margin of error of the whole poll. What it says is:Based on this sample size, the margin of error for any of the questions in this poll will be no more than 2.9% regardless of what the sample proportions are.In the particular question from the poll in this example, the margin of error happened to be lower.