Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Sum with a Mutiple criteria for each Row. Processing time is huge

Tags:

r

data.table

I am trying to do a sum over a vector CountB by "filtering" a DTX with multiple criteria

  • same location1,
  • same location2, and
  • only rows with CountA strictly less than the CountA in that particular row.
library(data.table)

set.seed(1)
norows <- 1e6
DTX <- data.table(
  Location1 = sample(LETTERS, norows, replace = TRUE),
  Location2 = sample(paste0(LETTERS, 1:26), norows, replace = TRUE),
  CountA = sample(100, size = norows, replace = TRUE),
  CountB = sample(100, size = norows, replace = TRUE)
)

DTX[, sumifs_result := DTX[.SD, 
                           on = .(Location1, Location2), 
                           by = .EACHI,
                           sum(CountB[x.CountA < i.CountA])
                           ]$V1][]

The approch works as expected, but is slow for large data. Therefore, I am looking for a more efficient approach.

I do not want to sort; the original order must be preserved. I would prefer to use only the data.table package.

like image 572
Mohit Avatar asked Oct 22 '25 20:10

Mohit


2 Answers

The criteria boil down to a conditional cumsum so it will be much faster and more memory efficient to sort the data, perform the computation and restore to the original order than the join and group approach being attempted.

library(data.table)
library(collapse)

DTX[, idx := .I
     ][order(CountA), sumifs_result := c(0, fcumsum(CountB)[-.N])[fmatch(CountA, CountA)],
       by = .(Location1, Location2)
       ][order(idx)
         ][, idx := NULL
           ]

This is about 750x faster on your example data than the original solution. Note you can replace fmatch()/fcumsum() with match()/cumsum() to remove the collapse dependency but these make the solution about twice as fast compared to the base versions.

like image 129
Ritchie Sacramento Avatar answered Oct 24 '25 11:10

Ritchie Sacramento


Another data.table approach. Optimised for the example data.

foo <- function(dt) {
  LUT <- dt[, .(v = sum(CountB)), keyby = .(Location1, Location2, CountA)
            ][, v := cumsum(v), by = .(Location1, Location2)
              ][, v := shift(v, fill = 0), by = .(Location1, Location2)]

  dt[, sumifs_result := LUT[.SD, on = .(Location1, Location2, CountA), v]]
}

Benchmarking against the fastest solution thus far.

identical(jblood(dt), foo(dt))
# [1] TRUE
microbenchmark::microbenchmark(
  sbaldur = foo(dt),
  jblood = jblood(dt),
  setup = {dt <- copy(DTX)},
  unit = "relative",
  times = 25
)
# Unit: relative
#     expr      min       lq     mean   median       uq      max neval
#  sbaldur 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    25
#   jblood 2.294585 2.080021 1.905392 1.971365 1.783897 1.628229    25
like image 36
sindri_baldur Avatar answered Oct 24 '25 11:10

sindri_baldur



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!