R: Sum with a Mutiple criteria for each Row. Processing time is huge

Question

I am trying to do a sum over a vector CountB by "filtering" a DTX with multiple criteria

same location1,
same location2, and
only rows with CountA strictly less than the CountA in that particular row.

library(data.table)

set.seed(1)
norows <- 1e6
DTX <- data.table(
  Location1 = sample(LETTERS, norows, replace = TRUE),
  Location2 = sample(paste0(LETTERS, 1:26), norows, replace = TRUE),
  CountA = sample(100, size = norows, replace = TRUE),
  CountB = sample(100, size = norows, replace = TRUE)
)

DTX[, sumifs_result := DTX[.SD, 
                           on = .(Location1, Location2), 
                           by = .EACHI,
                           sum(CountB[x.CountA < i.CountA])
                           ]$V1][]

The approch works as expected, but is slow for large data. Therefore, I am looking for a more efficient approach.

I do not want to sort; the original order must be preserved. I would prefer to use only the data.table package.

Ritchie Sacramento · Accepted Answer

The criteria boil down to a conditional cumsum so it will be much faster and more memory efficient to sort the data, perform the computation and restore to the original order than the join and group approach being attempted.

library(data.table)
library(collapse)

DTX[, idx := .I
     ][order(CountA), sumifs_result := c(0, fcumsum(CountB)[-.N])[fmatch(CountA, CountA)],
       by = .(Location1, Location2)
       ][order(idx)
         ][, idx := NULL
           ]

This is about 750x faster on your example data than the original solution. Note you can replace fmatch()/fcumsum() with match()/cumsum() to remove the collapse dependency but these make the solution about twice as fast compared to the base versions.

sindri_baldur · Answer

Another data.table approach. Optimised for the example data.

foo <- function(dt) {
  LUT <- dt[, .(v = sum(CountB)), keyby = .(Location1, Location2, CountA)
            ][, v := cumsum(v), by = .(Location1, Location2)
              ][, v := shift(v, fill = 0), by = .(Location1, Location2)]

  dt[, sumifs_result := LUT[.SD, on = .(Location1, Location2, CountA), v]]
}

Benchmarking against the fastest solution thus far.

identical(jblood(dt), foo(dt))
# [1] TRUE
microbenchmark::microbenchmark(
  sbaldur = foo(dt),
  jblood = jblood(dt),
  setup = {dt <- copy(DTX)},
  unit = "relative",
  times = 25
)
# Unit: relative
#     expr      min       lq     mean   median       uq      max neval
#  sbaldur 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    25
#   jblood 2.294585 2.080021 1.905392 1.971365 1.783897 1.628229    25

R: Sum with a Mutiple criteria for each Row. Processing time is huge

Tags:

r

data.table

Mohit

2 Answers

Ritchie Sacramento

sindri_baldur

Recent Activity

Donate For Us

R: Sum with a Mutiple criteria for each Row. Processing time is huge

Tags:

r

data.table

Mohit

2 Answers

Ritchie Sacramento

sindri_baldur

Related questions

Recent Activity

Donate For Us