Suppose we have the following situation:
Here is some R code to represent this problem:
library(tidyverse)
set.seed(123)
ids <- 1:100
student_id <- sort(sample(ids, 100000, replace = TRUE))
coin_result <- character(1000)
coin_result[1] <- sample(c("H", "T"), 1)
for (i in 2:length(coin_result)) {
if (student_id[i] != student_id[i-1]) {
coin_result[i] <- sample(c("H", "T"), 1)
} else if (coin_result[i-1] == "H") {
coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.6, 0.4))
} else {
coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.4, 0.6))
}
}
my_data <- data.frame(student_id, coin_result)
my_data <- my_data[order(my_data$student_id),]
final <- my_data %>%
group_by(student_id) %>%
mutate(flip_number = row_number())
The data looks something like this:
# A tibble: 6 x 3
# Groups: student_id [1]
student_id coin_result flip_number
<int> <chr> <int>
1 1 H 1
2 1 H 2
3 1 H 3
4 1 H 4
5 1 T 5
6 1 H 6
My Problem: In this scenario, let's say that I do not have any prior knowledge about this coin (i.e. I only have access to the data from the students) and I think its possible that the coin might have "correlated probabilities" - particularly, I think the result of the previous flip might influence the next flip. To test this hypothesis, I can perform the following analysis:
Randomly sample with replacement students until you have the same number of students as the original data.
For each of these students selected, randomly choose a starting point x and ending point y (where y>x), and select all available data between x and y for a given student.
Then, calculate the probabilities and 95% Confidence Intervals.
Repeat this process k times.
Here is my attempt to code the above procedure:
library(dplyr)
set.seed(123)
n_boot <- 1000
boot_results2 <- matrix(NA, nrow = n_boot, ncol = 4)
colnames(boot_results2) <- c("P(H|H)", "P(T|H)", "P(H|T)", "P(T|T)")
for (b in 1:n_boot) {
print(b)
boot_students <- sample(unique(final$student_id), replace = TRUE)
boot_data <- data.frame(student_id = integer(0), coin_result = character(0), stringsAsFactors = FALSE)
for (s in boot_students) {
student_data <- final %>% filter(student_id == s)
x <- sample(nrow(student_data), 1)
y <- sample(x:nrow(student_data), 1)
student_data <- student_data[x:y, ]
boot_data <- rbind(boot_data, student_data)
}
p_hh <- mean(boot_data$coin_result[-1] == "H" & boot_data$coin_result[-nrow(boot_data)] == "H")
p_th <- mean(boot_data$coin_result[-1] == "H" & boot_data$coin_result[-nrow(boot_data)] == "T")
p_ht <- mean(boot_data$coin_result[-1] == "T" & boot_data$coin_result[-nrow(boot_data)] == "H")
p_tt <- mean(boot_data$coin_result[-1] == "T" & boot_data$coin_result[-nrow(boot_data)] == "T")
boot_results2[b, ] <- c(p_hh, p_th, p_ht, p_tt)
}
My Question: While the code seems to be running - it is taking a very long time to run. I am also not sure if I have written this correctly.
Can someone please show me how to do this correctly?
Thanks!
Note: Optional Code to Visualize Results:
library(ggplot2)
boot_results_long2 <- as.data.frame(boot_results2)
boot_results_long2$iteration <- 1:n_boot
boot_results_long2 <- boot_results_long2 %>%
gather(key = "coin", value = "probability", -iteration)
ggplot(boot_results_long2, aes(x = iteration, y = probability, color = coin)) +
geom_line() +
labs(x = "Iteration", y = "Probability", color = "Coin") +
scale_color_discrete(labels = c("P(H|H)", "P(T|H)", "P(H|T)", "P(T|T)"))
It looks like the main bottleneck is the inner loop. We can make that inner loop about 20x as fast by replacing:
tictoc::tic()
for (s in boot_students) {
student_data <- final %>% filter(student_id == s)
x <- sample(nrow(student_data), 1)
y <- sample(x:nrow(student_data), 1)
student_data <- student_data[x:y, ]
boot_data <- rbind(boot_data, student_data)
}
tictoc::toc()
# around 2.5s on my machine
with
tictoc::tic()
boot_data <- final %>%
left_join(
final %>%
ungroup() %>%
summarize(n = n(), .by = student_id) %>%
rowwise() %>%
mutate(x = sample(1:n, 1),
y = sample(x:n, 1))
) %>%
filter(flip_number >= x, flip_number <= y)
tictoc::toc()
# around 0.1s on my machine
Your original loop includes a few inefficient steps:
It would be more efficient to do (2) for all the students at once, then (3), skipping 1+4. 4 is particularly costly, see chapter 2 ("Growing Objects") of the R Inferno: https://www.burns-stat.com/pages/Tutor/R_inferno.pdf
I'm confident this could be accelerated much further, but perhaps this gets into the region of "fast enough for now."
If we want to sample each student_id without any weight, we can approach it similar to @JonSpring's brilliant answer by grouping instead of joining. For my computer, the data.table approach is about 3 times faster than the dplyr method below.
my_sample = function(data) {
x = sample(nrow(data), 1L)
y = sample(x:nrow(data), 1L)
return(data[x:y,])
}
## dplyr
final %>%
group_by(student_id) %>%
group_modify(~(my_sample(.x)))
## data.table
library(data.table)
finalDT = as.data.table(ungroup(final))
finalDT[, my_sample(.SD), student_id]
If we instead want to do something similar to boot_students = sample(unique(final$student_id), replace = TRUE) and loop through them, the data.table solution should be relatively performant as we can set a key up front and then loop through all of the students to filter.
setkey(finalDT, student_id)
boot_students = sample(unique(final$student_id), replace = TRUE)
lapply(boot_students,
function (student) finalDT[student_id == student, my_sample(.SD)]) |>
rbindlist()
For me, it is about 20x faster than OP and provides identical results as original inner loop. This approach could be made more performant if we dropped to Rcpp as there are methods to subset data.table in Rcpp now which should be more efficient. But... it's probably fast enough ;).
As an aside, we probably can do the end part more efficiently using table and potentially prop.table().
## from above
boot_data = lapply(boot_students,
function (student) finalDT[student_id == student, my_sample(.SD)]) %>%
rbindlist()
## table and prop.table on separate lines so intermediate results can be seen
table(head(boot_data$coin_result, -1), tail(boot_data$coin_result, -1)) |>
prop.table()
## H T
## H 0.2820237 0.2035307
## T 0.2035307 0.3109150
To graph, we could simply replace the inner loop of OP with one of the examples above. Alternatively, this could be a way to graph using data.table and ggplot. Instead of geom_line() which might not be the best for discrete data, here is geom_boxplot() but more thought could be put into how to visualize it.
library(data.table)
library(ggplot2)
## initializing data.table
finalDT = as.data.table(ungroup(final))
setkey(finalDT, student_id)
## helper methods to subset and then transform into prop.table
my_sample = function(data) {
x = sample(nrow(data), 1L)
y = sample(x:nrow(data), 1L)
return(data[x:y,])
}
create_prop_table = function (x) {
table(head(x, -1L), tail(x, -1L)) |>
prop.table()
}
## subset up front for small optimization
unique_students = unique(finalDT[["student_id"]])
## lapply is just a loop
lapply(seq_len(n_boot),
function(iteration) {
boot_students = sample(unique_students, replace = TRUE)
finalDT[.(boot_students), my_sample(.SD), by = .EACHI
][, as.data.table(create_prop_table(coin_result))
][,c("iteration", "coin") := .(iteration, paste0("P(",V1, "|", V2, ")"))]
}) |>
rbindlist() |> ## combine 100 simulations into one data.table
ggplot(aes(x = coin, y = N, color = coin)) +
geom_boxplot() +
labs(x = "Coin", y = "Probability", color = "Coin") +
scale_color_discrete(labels = c("P(H|H)", "P(T|H)", "P(H|T)", "P(T|T)")) +
ggtitle(paste0("Randomized trials based on ", n_boot, " bootstrap simulations"))

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With