Click here to return to the main page

I worked next to Lindsey Carlson on this assignment, and we discussed some of the questions together.

Question 1: Using a for loop, write a function to calculate the number of zeroes in a numeric vector. Before entering the loop, set up a counter variable counter <- 0. Inside the loop, add 1 to counter each time you have a zero in the matrix. Finally, use return(counter) for the output.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
##########################
# FUNCTION:zero_count
# purpose: Count the nuber of zeros in a numeric vector
# input:A numeric vector
# output: A single number that is a count of the number of zeros in that vector
# ------------------------

zero_count <- function(vector = sample(-100:100, 1000, replace = TRUE)){
  counter <- 0
for (i in 1:length(vector)) {
  if(vector[i] == 0){ 
    counter <- counter+1
  } # end if statement
} # end forloop
    return(counter)
} # end function

  zero_count()
## [1] 8

Question 2: Use subsetting instead of a loop to rewrite the function as a single line of code.

##########################
# FUNCTION:zero_count_short
# purpose: Count the nuber of zeros in a numeric vector using subsetting
# input:A numeric vector
# output: A single number that is a count of the number of zeros in that vector
# ------------------------
zero_count_simple <- function(vector = sample(-100:100, 1000, replace = TRUE)){
  return(length(vector[vector==0]))
}
zero_count_simple()
## [1] 11

Question 3: Write a function that takes as input two integers representing the number of rows and columns in a matrix. The output is a matrix of these dimensions in which each element is the product of the row number x the column number.

##########################
# FUNCTION:multiply_row_col
# purpose:multiplies the row number and column number by each other and puts these in a matrix of given size
# input: two integers denoting the desired number of rows and columns of a matrix
# output: a matrix with the product of the row number by the column number in each cell
# ------------------------
multiply_row_col <- function(nrow = 3, ncol = 3) {
  mat <- matrix(nrow = 3, ncol = 3)
  for (i in 1:nrow){
    for (j in 1:ncol) {
      mat[i,j] <- i*j
    } # end j forloop
  } # end i forloop
  print(mat)
}# end function
multiply_row_col()
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    2    4    6
## [3,]    3    6    9

Question 4: In the next few lectures, you will learn how to do a randomization test on your data. We will complete some of the steps today to practice calling custom functions within a for loop. Use the code from the March 31st lecture (Randomization Tests) to complete the following steps:

4a: Simulate a dataset with 3 groups of data, each group drawn from a distribution with a different mean. The final data frame should have 1 column for group and 1 column for the response variable.

set.seed(100)
group <- as.factor(c(rep("A", 20), rep("B", 20), rep("C", 20))) #make group vector
A <- rnorm(n = 20, mean = 10, sd= 2.5) #make corresponding data for group A
B <- rnorm(n = 20, mean = 175, sd = 5)#make corresponding data for group B
C <- rnorm(n = 20, mean = 50, sd = 1)#make corresponding data for group C
resp <- c(A, B, C) #combine data into one vector for the responses
q4_df <- data.frame(group, resp) # combine response vector and groups into a data frame
head(q4_df)
##   group      resp
## 1     A  8.744519
## 2     A 10.328828
## 3     A  9.802707
## 4     A 12.216962
## 5     A 10.292428
## 6     A 10.796575

4b: Write a custom function that 1) reshuffles the response variable, and 2) calculates the mean of each group in the reshuffled data. Store the means in a vector of length 3.

##########################
# FUNCTION:reshuffle
# purpose: reshuffles the response variable of a data frame and then provides the mean for each group
# input: a data frame of two columns, response and group, and the name of the response column as a string
# output: the mean of each reshuffled group
# ------------------------
reshuffle <- function(df = q4_df) {
  colnames(df) <- c("group", "response")
  df$response <- sample(df$response, length(df$response), replace = FALSE)

    df2 <- df %>%
     group_by(group) %>%
     summarize(mean = mean(response))
    return(as.vector(df2$mean))
}
reshuffle()
## [1]  64.77255  67.70974 103.23161

4c: Use a for loop to repeat the function in b 100 times. Store the results in a data frame that has 1 column indicating the replicate number and 1 column for each new group mean, for a total of 4 columns.

#create empty vectors to fill in for loop
rep <- rep(NULL, 100)
grp_meanA <- rep(NULL, 100)
grp_meanB <- rep(NULL, 100)
grp_meanC <- rep(NULL, 100)
df <- data.frame(rep, grp_meanA, grp_meanB, grp_meanC)

#For loop to reshuffle data 100 times and save output as a dataframe
for (i in 1:100) {
  df[i,1] <- i
  for (j in 1:3) {
  df_reshuff <- data.frame(c(A, B, C), reshuffle())
  df[i,j+1] <- df_reshuff[j,2]
  
  } # end j for loop
} #end i for loop
colnames(df) <- c("rep", "grp_meanA", "grp_meanB", "grp_meanC")
head(df)
##   rep grp_meanA grp_meanB grp_meanC
## 1   1  59.51702  86.36297  86.53452
## 2   2  71.49170  78.90783  75.55412
## 3   3  79.88202  88.04411  90.15688
## 4   4  77.37317  71.17513  76.14943
## 5   5  87.45984  83.53235  73.81204
## 6   6  86.32876  73.20259  75.74909

4d: Use qplot() to create a histogram of the means for each reshuffled group. Or, if you want a challenge, use ggplot() to overlay all 3 histograms in the same figure. How do the distributions of reshuffled means compare to the original means?

library(wesanderson)

meanA <- qplot(x = grp_meanA,
               data = df,
               main = "Reshuffled Group A means",
               xlab = "Group A Means",
               ylab = "Count",
               color = I("black"),
               fill = I(wes_palette("Moonrise1", n = 1))
               ) # plot means of 100 reshuffled groupA means
print(meanA)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

meanB <- qplot(x = grp_meanB,
               data = df,
               main = "Reshuffled Group B means",
               xlab = "Group B Means",
               ylab = "Count",
               color = I("black"),
               fill = I(wes_palette("Zissou1", n = 1))
               ) # plot means of 100 reshuffled groupB means
print(meanB)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

meanC <- qplot(x = grp_meanC,
               data = df,
               main = "Reshuffled Group C means",
               xlab = "Group C Means",
               ylab = "Count",
               color = I("black"),
               fill = I(wes_palette("GrandBudapest2", n = 1))
               )# plot means of 100 reshuffled groupC means
print(meanC)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.



Question 5: Use the histograms from above to determine if the original groups would fall in the tails of the reshuffled groups, indicating that the original groups are statistically different from each other

Worked with Lindsey Carlson on the following segment

library(tidyverse)
library(wesanderson)

#First step is to calculate an ANOVA comparing the initial groups to see if they are statistically different.  I will then visualize this using the reshuffled histograms

##########################
# FUNCTION:get_metric
# purpose: Calculate an ANOVA comparing groups
# input: a dataframe of two columns
# output: a p value comparing the groups
# ------------------------
get_metric <- function(df = q4_df) {
  test <- aov(resp ~ group, data = q4_df)
  return(summary(test)[[1]][["Pr(>F)"]][1])
}

x_obs <- get_metric(q4_df)
print(x_obs)
## [1] 2.298408e-84
#  The above shows that my groups are statistically different from each other.  I will use the histograms from part 4 to show this graphically by showing the means from my observed data on the simulated histograms.  Discussed this method with the TAs and got the okay to answer this question this way

#Plot Group A
meanA2 <-ggplot(aes(x = grp_meanA),
               data = df) +
               geom_histogram(color = "black", fill = wes_palette("Moonrise1", n = 1)) + geom_vline(xintercept=mean(A)) + #insert mean of group A line
geom_vline(xintercept = quantile(df$grp_meanC, probs = 0.025), color = "red") + 
  geom_vline(xintercept = quantile(df$grp_meanC, probs = 0.975), color = "red") #insert quantile lines
  
print(meanA2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# The red lines on the  histograms represent the 95% confidence interval on the simulated data and the black line is the observed mean.  The fact that the black line is outside of the red lines indicates that it is likely from a different population.

# PLot Group B
meanB2 <-ggplot(aes(x = grp_meanB),
               data = df) +
               geom_histogram(color = "black", fill = wes_palette("Zissou1", n = 1)) + geom_vline(xintercept=mean(B)) + #insert mean of group B line
  geom_vline(xintercept = quantile(df$grp_meanB, probs = 0.025), color = "red") +
  geom_vline(xintercept = quantile(df$grp_meanB, probs = 0.975), color = "red") #insert quantile lines
  
  
print(meanB2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#plot Group C
meanC2 <-ggplot(aes(x = grp_meanC),
               data = df) +
               geom_histogram(color = "black", fill = wes_palette("GrandBudapest2", n = 1)) + geom_vline(xintercept=mean(C)) + #insert mean of group C line
geom_vline(xintercept = quantile(df$grp_meanC, probs = 0.025), color = "red") +
  geom_vline(xintercept = quantile(df$grp_meanC, probs = 0.975), color = "red") #insert quantile lines
  
print(meanC2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.