Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cumulative sum of numeric vectors in list

I hope someone will be able to help me on this problem. I have a list object that includes 48 vectors and each vector has a length of 2,000,000 observations in it. Here is a code that creates the same structure with only 100,000 items per vector:

mtx_sim <- matrix(data = runif(48 * 100000), ncol = 48, nrow = 100000)
mtx_list <- as.list(data.frame(mtx_sim))

I want to cumulatively sum each row of the vectors in the list. However, there is one stipulation I only want to sum the last thirty vectors. For example, the 35th vector in the list should be added to the 34 preceding vectors. On the other hand, the fourth vector in the list should be added to the three preceding vectors (vector number three, two, and one). Here is my code example that relies on the lapply function in combination with rowSums, which is relatively slow:

start <- c(rep(1, times = 30), seq(2, 19, 1))
end <- seq(1,48,1)

system.time(xxx <- lapply(1:48, function(x)
rowSums(
  matrix(
    unlist(mtx_list[start[x]:end[x]]), 
    ncol = (end[x] - start[x] + 1)))
) )

 user  system elapsed 
 62.19    0.56   63.04 

Does anyone have an idea to optimize the code?

like image 283
user1738753 Avatar asked Dec 29 '25 18:12

user1738753


1 Answers

You are doing two expensive things in an otherwise reasonable algorithm:

  1. You are recreating a matrix from your list for every iteration; this is likely slow
  2. You are recomputing the entire row sums repeatedly, when in reality you just need to calculate the marginal changes

Here is an alternative. We reconstitute the original matrix once, and then just add the marginal columns.

fun_brodie <- function(mtx_list) {
  mtx <- do.call(cbind, mtx_list)
  base <- mtx[, 1]
  res <- list(base)
  for(i in seq(ncol(mtx))[-1]) 
    res[[i]] <- res[[i - 1]] + mtx[, i] - if(i > 30) mtx[, i - 30] else 0
  res
}
res <- fun_brodie(mtx_list)

Confirm equal:

all.equal(res, xxx)
# [1] TRUE

Benchmarks:

library(microbenchmark)
microbenchmark(times=3, fun_marat(mtx_list), fun_brodie(mtx_list), fun_op(mtx_list))

Produces:

Unit: milliseconds
                 expr        min        lq       mean
  fun_marat(mtx_list)  1661.9135  1763.418  1800.3530
 fun_brodie(mtx_list)   115.7877   116.061   153.6794
     fun_op(mtx_list) 58059.7803 60388.303 62060.5557

Thanks to Marat for pointing out an interpretation error on my part. Also, note that in order to make fun_marat compartible I added a step of cbinding the list to a data frame.

like image 78
BrodieG Avatar answered Dec 31 '25 09:12

BrodieG



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!