After sampling i.e. first stage in the data production process is completed, we come to the second stage, that of gaining information about the variables of interest from the sampled individuals. In this module we'll discuss three study designs; each design enables you to determine the values of the variables in a different way.
You can: Carry out an observational study, in which values of the variable or variables of interest are recorded as they naturally occur. There is no interference by the researchers who conduct the study.
Take a sample survey, which is a particular type of observational study in which individuals report variables' values themselves, frequently by giving their opinions.
Perform an experiment. Instead of assessing the values of the variables as they naturally occur, the researchers interfere, and they are the ones who assign the values of the explanatory variable to the individuals. The researchers "take control" of the values of the explanatory variable because they want to see how changes in the value of the explanatory variable affect the response variable. (Note: By nature, any experiment involves at least two variables.)
The type of design used, and the details of the design, are crucial, since they will determine what kind of conclusions we may draw from the results. In particular, when studying relationships in the Exploratory Data Analysis unit, we stressed that an association between two variables does not guarantee that a causal relationship exists. We will explore how the details of a study design play a crucial role in determining our ability to establish evidence of causation.
Because each type of study design has its own advantages and trouble spots, it is important to begin by determining what type of study we are dealing with. The following example helps to illustrate how we can distinguish among the three basic types of design mentioned in the introduction — observational studies, sample surveys, and experiments.
Example : Suppose researchers want to determine whether people tend to snack more while they watch television. In other words, the researchers would like to explore the relationship between the explanatory variable "TV" (a categorical variable that takes the values "on'" and "not on") and the response variable "snack consumption."
Identify each of the following designs as being an observational study, a sample survey, or an experiment.
Recruit participants for a study. While they are presumably waiting to be interviewed, half of the individuals sit in a waiting room with snacks available and a TV on. The other half sit in a waiting room with snacks available and no TV, just magazines. Researchers determine whether people consume more snacks in the TV setting.
This is an experiment, because the researchers take control of the explanatory variable of interest (TV on or not) by assigning each individual to either watch TV or not, and determine the effect that has on the response variable of interest (snack consumption).
Recruit participants for a study. Give them journals to record hour by hour their activities the following day, including when they watch TV and when they consume snacks. Determine if snack consumption is higher during TV times.
This is an observational study, because the participants themselves determine whether or not to watch TV. There is no attempt on the researchers' part to interfere.
Recruit participants for a study. Ask them to recall, for each hour of the previous day, whether they were watching TV, and what snacks they consumed each hour. Determine whether snack consumption was higher during the TV times.
This is also an observational study; again, it was the participants themselves who decided whether or not to watch TV. Do you see the difference between 2 and 3? See the comment below.
Poll a sample of individuals with the following question: While watching TV, do you tend to snack: (a) less than usual; (b) more than usual; or (c) the same amount as usual?
This is a sample survey, because the individuals self-assess the relationship between TV watching and snacking.
Comment : Notice that in Example 2, the values of the variables of interest (TV watching and snack consumption) are recorded forward in time. Such observational studies are called prospective. In contrast, in Example 3, the values of the variables of interest are recorded backward in time. This is called a retrospective observational study.
Indeed, a study in which individuals report variable values themselves (frequently their opinions), is a survey.
a retrospective observational study involves recording variables' values that naturally happened in the past.
a prospective observational study records the values of variables (in this case, baby's growth) as they naturally happen forward in time.
this is an experiment, since a treatment (Botox/salt water) was imposed on the individuals
Example : Every day, a huge number of people are engaged in a struggle whose outcome could literally affect the length and quality of their life: they are trying to quit smoking. Just the array of techniques, products, and promises available shows that quitting is not easy, nor is its success guaranteed. Researchers would like to determine which of the following is the best method:
Drugs that alleviate nicotine addiction.
Therapy that trains smokers to quit.
A combination of drugs and therapy.
Neither form of intervention (quitting "cold turkey").
The explanatory variable is the method (1, 2, 3 or 4) , while the response variable is eventual success or failure in quitting. In an observational study, values of the explanatory variable occur naturally. In this case, this means that the participants themselves choose a method of trying to quit smoking. In an experiment, researchers assign the values of the explanatory variable. In other words, they tell people what method to use. Let us consider how we might compare the four techniques, via either an observational study or an experiment.
An observational study of the relationship between these two variables requires us to collect a representative sample from the population of smokers who are beginning to try to quit. We can imagine that a substantial proportion of that population is trying one of the four above methods. In order to obtain a representative sample, we might use a nationwide telephone survey to identify 1,000 smokers who are just beginning to quit smoking. We record which of the four methods the smokers use. One year later, we contact the same 1,000 individuals and determine whether they succeeded.
In an experiment, we again collect a representative sample from the population of smokers who are just now trying to quit, using a nationwide telephone survey of 1,000 individuals. This time, however, we divide the sample into 4 groups of 250 and assign each group to use one of the four methods to quit. One year later, we contact the same 1,000 individuals and determine whose attempts succeeded while using our designated method.
The following figures illustrate the two study designs:
Observational study : A visual representation of the Observational Study. A large circle represents the entire population. Through random selection we generate the sample, which is represented as a smaller circle. The circle representing the samples is divided up unevenly into 4 pieces, each piece representing one value of the explanatory variable (method), which have been "Self- Assigned" by the people in the sample.
Experiment : A visual representation of the Experimental Study. A large circle represents the entire population. Through random selection we generate the sample, which is represented as a smaller circle. The circle representing the samples is divided up evenly into 4 pieces, each piece representing one value of the explanatory variable (method), which have been assigned by the researchers.
Both the observational study and the experiment begin with a random sample from the population of smokers just now beginning to quit. In both cases, the individuals in the sample can be divided into categories based on the values of the explanatory variable: method used to quit. The response variable is success or failure after one year. Finally, in both cases, we would assess the relationship between the variables by comparing the proportions of success of the individuals using each method, using a two-way table and conditional percentages.
The only difference between the two methods is the way the sample is divided into categories for the explanatory variable (method). In the observational study, individuals are divided based upon the method by which they choose to quit smoking. The researcher does not assign the values of the explanatory variable, but rather records them as they naturally occur. In the experiment, the researcher deliberately assigns one of the four methods to each individual in the sample. The researcher intervenes by controlling the explanatory variable, and then assesses its relationship with the response variable.
We assigned each subject to a weight loss program.
We would not be able to assign each student to a university.
The company would be able to decide which trucks will have their tires rotated and which ones will not
Suppose the observational study described above were carried out, and researchers determined that the percentage succeeding with the combination drug/therapy method was highest, while the percentage succeeding with neither therapy nor drugs was lowest.
In other words, suppose there is clear evidence of an association between method used and success rate. Could they then conclude that the combination drug/therapy method causes success more than using neither therapy nor a drug?
A visual representation of the relationship between the Exploratory Variable and Response Variable. The Exploratory Variable is the Method being used, and it might or might not be related to the Response Variable, which is either Success or Failure.
It is at precisely this point that we confront the underlying weakness of most observational studies : some members of the sample have opted for certain values of the explanatory variable (method of quitting), while others have opted for other values.
It could be that those individuals may be different in additional ways that would also play a role in the response of interest. For instance, suppose women are more likely to choose certain methods to quit, and suppose women in general tend to quit more successfully than men.
The data would make it appear that the method itself were responsible for success, whereas in truth it may just be that being female is the reason for success. We can express this scenario in terms of the key variables involved. In addition to the explanatory variable (method) and the response variable (success or failure), a third, lurking variable (gender) is tied in (or confounded) with the explanatory variable's values, and may itself cause the response to be success or failure. The following diagram illustrates this situation.
Here, the Exploratory Variable, which is the Method, may affect the Response Variable, which is Success or Failure. We also have a Lurking Variable, which is Gender. It is confounded with the Exploratory Variable, so it may also be affecting the Response Variable.
Since the difficulty arises because of the lurking variable's values being tied in with those of the explanatory variable, one way to attempt to unravel the true nature of the relationship between explanatory and response variables is to separate out the effects of the lurking variable. In general, we control for the effects of a lurking variable by separately studying groups that are similar with respect to this variable.
We could control for the lurking variable "gender" by studying women and men separately. Then, if both women and men who chose one method have higher success rates than those opting for another method, we would be closer to producing evidence of causation.
Now, we have two separate studies. The study for Women is represented using one circle representing the Method, connected to another circle with an arrow. The other circle represents Success or Failure. We have the same diagram for Men, showing that Women and Men are now separated into totally different studies.
The diagram above demonstrates how straightforward it is to control for the lurking variable gender.
Notice that we did not claim that controlling for gender would allow us to make a definite claim of causation, only that we would be closer to establishing a causal connection. This is due to the fact that other lurking variables may also be involved, such as the level of the participants' desire to quit. Specifically, those who have chosen to use the drug/therapy method may already be the ones who are most determined to succeed, while those who have chosen to quit without investing in drugs or therapy may, from the outset, be less committed to quitting. The following diagram illustrates this scenario.
The Exploratory Variable, which is the Method, may affect the Response Variable, which is Success or Failure. We also have a Lurking Variable, which is Desire to Quit. It is confounded with the Exploratory Variable because those with more desire to quit may use drugs and/or therapy. So, the Lurking Variable may also be affecting the Response Variable.
To attempt to control for this lurking variable, we could interview the individuals at the outset in order to rate their desire to quit on a scale of 1 (weakest) to 5 (strongest), and study the relationship between method and success separately for each of the five groups. But desire to quit is obviously a very subjective thing, difficult to assign a specific number to. Realistically, we may be unable to effectively control for the lurking variable "desire to quit."
Furthermore, who's to say that gender and/or desire to quit are the only lurking variables involved? There may be other subtle differences among individuals who choose one of the four various methods that researchers fail to imagine as they attempt to control for possible lurking variables. For example, smokers who opt to quit using neither therapy nor drugs may tend to be in a lower income bracket than those who opt for (and can afford) drugs and/or therapy. Perhaps smokers in a lower income bracket also tend to be less successful in quitting because more of their family members and co-workers smoke. Thus, socioeconomic status is yet another possible lurking variable in the relationship between cessation method and success rate.
It is because of the existence of a virtually unlimited number of potential lurking variables that we can never be 100% certain of a claim of causation based on an observational study. On the other hand, observational studies are an extremely common tool used by researchers to attempt to draw conclusions about causal connections. If great care is taken to control for the most likely lurking variables (and to avoid other pitfalls which we will discuss presently), and if common sense indicates that there is good reason for one variable to cause changes in the other, then researchers may assert that an observational study provides good evidence of causation. The key to establishing causation is to rule out the possibility of any lurking variable, or in other words, to ensure that individuals differ only with respect to the values of the explanatory variable. In general, this is a goal which we have a much better chance of accomplishing by carrying out a well-designed experiment.