Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Writing a Random Sampling Procedure for Coin Flips

Tags:

r

Suppose we have the following situation:

  • There is a coin where if it lands head then the probability of the next flip being heads is 0.6 (and if tails then the next flip being tails is also 0.6)
  • There are 100 students in a class
  • Each student flips this coin a random number of times
  • The last flip of student_n does not influence the first flip of student_n+1 (i.e. when the next student flips the coin, the first flip has 0.5 probability of heads or tails, but the next flip for this student depends on the previous flip)

Here is some R code to represent this problem:

library(tidyverse)

set.seed(123)
ids <- 1:100
student_id <- sort(sample(ids, 100000, replace = TRUE))
coin_result <- character(1000)
coin_result[1] <- sample(c("H", "T"), 1)

for (i in 2:length(coin_result)) {
  if (student_id[i] != student_id[i-1]) {
    coin_result[i] <- sample(c("H", "T"), 1)
  } else if (coin_result[i-1] == "H") {
    coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.6, 0.4))
  } else {
    coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.4, 0.6))
  }
}

my_data <- data.frame(student_id, coin_result)
my_data <- my_data[order(my_data$student_id),]

final <- my_data %>%
    group_by(student_id) %>%
    mutate(flip_number = row_number())
The data looks something like this:

# A tibble: 6 x 3
# Groups:   student_id [1]
  student_id coin_result  flip_number
       <int> <chr>              <int>
1          1 H                      1
2          1 H                      2
3          1 H                      3
4          1 H                      4
5          1 T                      5
6          1 H                      6

My Problem: In this scenario, let's say that I do not have any prior knowledge about this coin (i.e. I only have access to the data from the students) and I think its possible that the coin might have "correlated probabilities" - particularly, I think the result of the previous flip might influence the next flip. To test this hypothesis, I can perform the following analysis:

  • Randomly sample with replacement students until you have the same number of students as the original data.

  • For each of these students selected, randomly choose a starting point x and ending point y (where y>x), and select all available data between x and y for a given student.

  • Then, calculate the probabilities and 95% Confidence Intervals.

  • Repeat this process k times.

Here is my attempt to code the above procedure:

library(dplyr)
set.seed(123)

n_boot <- 1000

boot_results2 <- matrix(NA, nrow = n_boot, ncol = 4)
colnames(boot_results2) <- c("P(H|H)", "P(T|H)", "P(H|T)", "P(T|T)")

for (b in 1:n_boot) {

  print(b)
  

  boot_students <- sample(unique(final$student_id), replace = TRUE)
  

  boot_data <- data.frame(student_id = integer(0), coin_result = character(0), stringsAsFactors = FALSE)
  
  for (s in boot_students) {

    student_data <- final %>% filter(student_id == s)
    

    x <- sample(nrow(student_data), 1)
    y <- sample(x:nrow(student_data), 1)
    

    student_data <- student_data[x:y, ]
    

    boot_data <- rbind(boot_data, student_data)
  }
  

  p_hh <- mean(boot_data$coin_result[-1] == "H" & boot_data$coin_result[-nrow(boot_data)] == "H")
  p_th <- mean(boot_data$coin_result[-1] == "H" & boot_data$coin_result[-nrow(boot_data)] == "T")
  p_ht <- mean(boot_data$coin_result[-1] == "T" & boot_data$coin_result[-nrow(boot_data)] == "H")
  p_tt <- mean(boot_data$coin_result[-1] == "T" & boot_data$coin_result[-nrow(boot_data)] == "T")
  
  boot_results2[b, ] <- c(p_hh, p_th, p_ht, p_tt)
}

My Question: While the code seems to be running - it is taking a very long time to run. I am also not sure if I have written this correctly.

Can someone please show me how to do this correctly?

Thanks!

Note: Optional Code to Visualize Results:

library(ggplot2)

boot_results_long2 <- as.data.frame(boot_results2)
boot_results_long2$iteration <- 1:n_boot
boot_results_long2 <- boot_results_long2 %>%
  gather(key = "coin", value = "probability", -iteration)


ggplot(boot_results_long2, aes(x = iteration, y = probability, color = coin)) +
  geom_line() +
  labs(x = "Iteration", y = "Probability", color = "Coin") +
  scale_color_discrete(labels = c("P(H|H)", "P(T|H)", "P(H|T)", "P(T|T)"))
like image 277
stats_noob Avatar asked Dec 30 '25 15:12

stats_noob


2 Answers

It looks like the main bottleneck is the inner loop. We can make that inner loop about 20x as fast by replacing:

tictoc::tic()
for (s in boot_students) {
  student_data <- final %>% filter(student_id == s)
  x <- sample(nrow(student_data), 1)
  y <- sample(x:nrow(student_data), 1)
  student_data <- student_data[x:y, ]
  boot_data <- rbind(boot_data, student_data)
}
tictoc::toc()
# around 2.5s on my machine

with

tictoc::tic()
boot_data <- final %>%
    left_join(
      final %>%
        ungroup() %>%
        summarize(n = n(), .by = student_id) %>%
        rowwise() %>%
        mutate(x = sample(1:n, 1),
               y = sample(x:n, 1))
    ) %>%
    filter(flip_number >= x, flip_number <= y)
tictoc::toc()
# around 0.1s on my machine

Your original loop includes a few inefficient steps:

  1. create separate subset of final for each of 1000 students (we can skip this)
  2. pick random start (x) and end (y) from x:n for each student. (let's do this once and be vectorized)
  3. subset the data (let's do this once instead of 1000x)
  4. append to the prior students' data (we can skip this if we never separate the data by student to begin with)

It would be more efficient to do (2) for all the students at once, then (3), skipping 1+4. 4 is particularly costly, see chapter 2 ("Growing Objects") of the R Inferno: https://www.burns-stat.com/pages/Tutor/R_inferno.pdf

I'm confident this could be accelerated much further, but perhaps this gets into the region of "fast enough for now."

like image 111
Jon Spring Avatar answered Jan 10 '26 12:01

Jon Spring


If we want to sample each student_id without any weight, we can approach it similar to @JonSpring's brilliant answer by grouping instead of joining. For my computer, the data.table approach is about 3 times faster than the dplyr method below.

my_sample = function(data) {
  x = sample(nrow(data), 1L)
  y = sample(x:nrow(data), 1L)
  return(data[x:y,])
}

## dplyr
final %>%
  group_by(student_id) %>%
  group_modify(~(my_sample(.x)))

## data.table
library(data.table)
finalDT = as.data.table(ungroup(final))
finalDT[, my_sample(.SD), student_id]

If we instead want to do something similar to boot_students = sample(unique(final$student_id), replace = TRUE) and loop through them, the data.table solution should be relatively performant as we can set a key up front and then loop through all of the students to filter.

setkey(finalDT, student_id)
boot_students = sample(unique(final$student_id), replace = TRUE)

lapply(boot_students,
       function (student)  finalDT[student_id == student, my_sample(.SD)]) |>
  rbindlist()

For me, it is about 20x faster than OP and provides identical results as original inner loop. This approach could be made more performant if we dropped to Rcpp as there are methods to subset data.table in Rcpp now which should be more efficient. But... it's probably fast enough ;).

As an aside, we probably can do the end part more efficiently using table and potentially prop.table().

## from above
boot_data = lapply(boot_students,
       function (student)  finalDT[student_id == student, my_sample(.SD)]) %>%
  rbindlist()

## table and prop.table on separate lines so intermediate results can be seen
table(head(boot_data$coin_result, -1), tail(boot_data$coin_result, -1)) |>
  prop.table()

##              H         T
##    H 0.2820237 0.2035307
##    T 0.2035307 0.3109150

To graph, we could simply replace the inner loop of OP with one of the examples above. Alternatively, this could be a way to graph using data.table and ggplot. Instead of geom_line() which might not be the best for discrete data, here is geom_boxplot() but more thought could be put into how to visualize it.

library(data.table)
library(ggplot2)

## initializing data.table 
finalDT = as.data.table(ungroup(final))
setkey(finalDT, student_id)

## helper methods to subset and then transform into prop.table
my_sample = function(data) {
  x = sample(nrow(data), 1L)
  y = sample(x:nrow(data), 1L)
  return(data[x:y,])
}

create_prop_table = function (x) {
  table(head(x, -1L), tail(x, -1L)) |>
    prop.table()
}

## subset up front for small optimization
unique_students = unique(finalDT[["student_id"]])

## lapply is just a loop
lapply(seq_len(n_boot), 
       function(iteration) {
         boot_students = sample(unique_students, replace = TRUE)
         finalDT[.(boot_students), my_sample(.SD), by = .EACHI
                 ][, as.data.table(create_prop_table(coin_result))
                   ][,c("iteration", "coin") := .(iteration, paste0("P(",V1, "|", V2, ")"))]
       }) |> 
  rbindlist() |> ## combine 100 simulations into one data.table
  ggplot(aes(x = coin, y = N, color = coin)) +
  geom_boxplot() +
  labs(x = "Coin", y = "Probability", color = "Coin") +
  scale_color_discrete(labels = c("P(H|H)", "P(T|H)", "P(H|T)", "P(T|T)")) +
  ggtitle(paste0("Randomized trials based on ", n_boot, " bootstrap simulations"))

Boxplot of coin flip simulations

like image 26
Cole Avatar answered Jan 10 '26 13:01

Cole



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!