Click here to return to the main page

Creating Fake Data Sets To Explore Hypotheses

Question 1: Think about an ongoing study in your lab (or a paper you have read in a different class), and decide on a pattern that you might expect in your experiment if a specific hypothesis were true.

For a paper I’m working on I would expect the density of tree seedlings growing on tipped over root mounds to increase as the height of the mounds increases due to reduced competition for light from surrounding shrubs and saplings. This requires continuous dependent and independent variables meaning I will run a form of linear regression on these data.

Question 2: To start simply, assume that the data in each of your treatment groups follow a normal distribution. Specify the sample sizes, means, and variances for each group that would be reasonable if your hypothesis were true. You may need to consult some previous literature and/or an expert in the field to come up with these numbers.

Sample size: 20

Root Mound Height:

Mean: 1.8m

SD: 0.7

Slope = 2.5

intercept = 0

As this is the first study to look at this relationship and the literature doesn’t contain information on the relationship between root mound height and the number of seedlings on the root mound, I based these numbers on the actual mean and SD from the dataset I have and edited them a bit to make them more useable. I ran a simple lm on my dataset to get the slope and intercept.

Question 3: Using the methods we have covered in class, write code to create a random data set that has these attributes. Organize these data into a data frame with the appropriate structure.

library(ggplot2)
#Make equation to calculate density 
beta1 <- 2.5 # create slope
height <- rnorm(20, mean = 1.8, sd = 0.7) #create normally distributed x with parameters from above
beta0 <- 0 # intercept of 0

density <- beta1 * height + beta0 # calculate density based on the above variables

df <- data.frame(height, density) # make data frame of these vectors
df$index <- seq(1:length(height)) # add index column
head(df)
##     height  density index
## 1 2.106762 5.266905     1
## 2 2.578706 6.446766     2
## 3 1.949709 4.874273     3
## 4 2.118599 5.296498     4
## 5 2.049782 5.124455     5
## 6 1.897671 4.744178     6

Question 4: Now write code to analyze the data (probably as an ANOVA or regression analysis, but possibly as a logistic regression or contingency table analysis. Write code to generate a useful graph of the data.

lm1 <- lm(density ~ height, data = df) #make linear model
summary(lm1) #show summary of model
## Warning in summary.lm(lm1): essentially perfect fit: summary may be unreliable
## 
## Call:
## lm(formula = density ~ height, data = df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.888e-15 -2.692e-16 -4.730e-17  1.067e-16  4.144e-15 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 2.383e-15  1.142e-15 2.087e+00   0.0513 .  
## height      2.500e+00  5.616e-16 4.451e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.216e-15 on 18 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.982e+31 on 1 and 18 DF,  p-value: < 2.2e-16
plot(df$height, df$density, pch = 20, col = "black", xlab = "Root Mound Height", ylab = "Seedling Density", ylim = c(0, 15), xlim = c(0, 5)) #make graph

Question 5: Try running your analysis multiple times to get a feeling for how variable the results are with the same parameters, but different sets of random numbers.

# Run code a second time
#Root Mound Height
beta1 <- 2.5 # create slope
height <- rnorm(20, mean = 1.8, sd = 0.7) #create normally distributed x with parameters from above
beta0 <- 0 # intercept of 0

density <- beta1 * height + beta0 # make density based on these variables

df <- data.frame(height, density) # make data frame of these vectors
df$index <- seq(1:length(height)) # add index column

lm1 <- lm(density ~ height, data = df) #make linear model
summary(lm1) #make summary of model
## Warning in summary.lm(lm1): essentially perfect fit: summary may be unreliable
## 
## Call:
## lm(formula = density ~ height, data = df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -7.564e-16 -3.747e-16 -2.659e-16 -9.723e-17  2.783e-15 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -1.589e-15  6.913e-16 -2.298e+00   0.0337 *  
## height       2.500e+00  3.650e-16  6.850e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.313e-16 on 18 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.692e+31 on 1 and 18 DF,  p-value: < 2.2e-16
plot(df$height, df$density, pch = 20, col = "black", xlab = "Root Mound Height", ylab = "Seedling Density", ylim = c(0, 15), xlim = c(0, 5)) #make graph

####################
# Run code a third time
#Root Mound Height
beta1 <- 2.5 # create slope
height <- rnorm(20, mean = 1.8, sd = 0.7) #create normally distributed x with parameters from above
beta0 <- 0 # intercept of 0

density <- beta1 * height + beta0 # make density based on these variables

df <- data.frame(height, density) # make data frame of these vectors
df$index <- seq(1:length(height)) # add index column

lm1 <- lm(density ~ height, data = df) #make linear model
summary(lm1) #make summary of model
## Warning in summary.lm(lm1): essentially perfect fit: summary may be unreliable
## 
## Call:
## lm(formula = density ~ height, data = df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.243e-15 -1.116e-16 -4.710e-18  1.755e-17  2.499e-15 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -7.944e-16  4.351e-16 -1.826e+00   0.0845 .  
## height       2.500e+00  2.226e-16  1.123e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.974e-16 on 18 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.261e+32 on 1 and 18 DF,  p-value: < 2.2e-16
plot(df$height, df$density, pch = 20, col = "black", xlab = "Root Mound Height", ylab = "Seedling Density", ylim = c(0, 15), xlim = c(0, 5)) #make graph

Question 6/7: Adjusting parameters to change p values

# Run code with an added noise variable so that the simulated data not are not just a perfect line

#Root Mound Height
beta1 <- 2.5 # create slope
noise <- rnorm(n = 20, mean = 0, sd = 3) # add noise parameter that offsets points from the perfect regression line
height <- rnorm(20, mean = 1.8, sd = 0.7) #create normally distributed x with parameters from above
beta0 <- 0 # intercept of 0

density <- beta1 * height + beta0 + noise # make density based on these variables

df <- data.frame(height, density)

lm1 <- lm(density ~ height, data = df) #make linear model
summary(lm1) #make summary of model
## 
## Call:
## lm(formula = density ~ height, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8343 -1.4191  0.5212  1.5408  4.6754 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  5.009519   2.245663   2.231   0.0387 *
## height      -0.003378   1.084551  -0.003   0.9975  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.83 on 18 degrees of freedom
## Multiple R-squared:  5.388e-07,  Adjusted R-squared:  -0.05555 
## F-statistic: 9.699e-06 on 1 and 18 DF,  p-value: 0.9975
plot(df$height, df$density, pch = 20, col = "black", xlab = "Root Mound Height", ylab = "Seedling Density", ylim = c(0, 15), xlim = c(0, 5)) #make graph

Since my data were created to be a perfect line, the best way to make the data represent actual situations is to add a noise parameter as I did above. I played around with the standard deviation of this parameter to see what level of “noise” led to non-significant results. with an SD of 2, the model was usually significant, although not always. Once I got up to an SD of 3, the p value of the model was highly variable, and usually not significant.