Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorizing loops in R

Tags:

r

dplyr

purrr

I came across a problem on r-bloggers that I thought I'd try out as a fun project. However, I'm seeing that the loop used in the purrr::map() function is a bottleneck in my code so I was wondering if anyone has any ideas on how I can vectorize the calculation of scores in the below code and avoid the loop?

https://www.r-bloggers.com/2024/04/joint-fiddlin/

The gist of the problem is that Alice and Bob gets points based on pairs in a sequence of coin tosses. Alice gets a point if there are two sequential heads while Bob gets a point if heads is followed by tails. So in a sequence of (H, H, T, H, H), Alice gets 2 points while Bob gets 1 point.

I'm using these libs other than base library(dplyr) # bind_rows library(purrr) # map

I'm using this function to create one game

fn_is_heads <- function(count, sample, sample_size){
  ret <- sample(sample, size = sample_size, replace = TRUE, prob = c(0.5, 0.5))
  return(ret)
}

Then I'm using this to create a list of games (in this case 100,000 games of 100 tosses)

trials <- 1e5
flip_nbr <- 100

flips <- map(1:trials, .f = fn_is_heads, sample = c(TRUE, FALSE), sample_size = flip_nbr)

I wrote this function to calculate the scores per game

fn_score_vec <- function(flips){
  # alice gets points -> TRUE, TRUE
  # bob gets points -> TRUE, FALSE
  
  alice_pts <- sum(flips[-length(flips)] == TRUE & flips[-1] == TRUE)
  bob_pts <- sum(flips[-length(flips)] == TRUE & flips[-1] == FALSE)
  
  return(data.frame(alice = alice_pts, bob = bob_pts))
}

..and I use purrr::map() to iterate over the list of games and return the scores per game

score <- bind_rows(map(.x = flips, .f = fn_score_vec))

Calculating the scores takes about 11s on my pc. Does anyone have any suggestions on how to get rid of purrr::map() and speed up the calculation of the scores (other than calculating in parallel)?

like image 485
eirn Avatar asked Dec 21 '25 06:12

eirn


2 Answers

With a fair coin using base R:

trials <- 1e5
flip_nbr <- 100

system.time({
  flips <- matrix(sample(!(0:1), trials*flip_nbr, 1), trials)
  score <- data.frame(
    alice = rowSums(flips[,-flip_nbr] & flips[,-1]),
    bob = rowSums(flips[,-flip_nbr] & !flips[,-1])
  )
})
#>    user  system elapsed 
#>    0.67    0.08    0.75
like image 57
jblood94 Avatar answered Dec 23 '25 21:12

jblood94


Using runif and matrixStats::rowSums2.

flips <- matrix(runif(trials*flip_nbr) > .5, trials)
data.frame(
  alice=matrixStats::rowSums2(flips[, -flip_nbr] & flips[, -1]),
  bob=matrixStats::rowSums2(flips[, -flip_nbr] & !flips[, -1])
)

Using Rcpp.

Rcpp::cppFunction('
DataFrame calculateSums(int trials, int flip_nbr) {
  NumericVector rand = Rcpp::runif(trials*flip_nbr);
  LogicalMatrix flips(trials, flip_nbr);
  for (int i = 0; i < trials; ++i) {
    for (int j = 0; j < flip_nbr; ++j) {
      flips(i, j) = (rand[i*flip_nbr + j] > 0.5);
    }
  }
  IntegerVector alice(trials), bob(trials);
  for (int i = 0; i < trials; ++i) {
    for (int j = 0; j < flip_nbr - 1; ++j) {
      if (flips(i, j) && flips(i, j + 1)) {
        alice[i]++;
      }
      if (flips(i, j) && !flips(i, j + 1)) {
        bob[i]++;
      }
    }
  }
  return DataFrame::create(_["alice"] = alice, _["bob"] = bob);
}
')

calculateSums(trials, flip_nbr)

Benchmark

$ Rscript --vanilla foo.R
Unit: milliseconds
   expr       min        lq      mean    median        uq       max neval cld
 sample 1234.5062 1253.5345 1276.6813 1275.4404 1291.8341 1343.2751    10 a  
  runif  819.1189  833.8159  871.1204  854.9161  876.7099  991.1695    10  b 
   rcpp  255.5373  266.1039  288.9363  284.2602  303.4640  338.6769    10   c
like image 27
jay.sf Avatar answered Dec 23 '25 23:12

jay.sf



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!