I am trying to do a sum over a vector CountB by "filtering" a DTX with multiple criteria
location1,location2, andCountA strictly less than the CountA in that particular row.library(data.table)
set.seed(1)
norows <- 1e6
DTX <- data.table(
Location1 = sample(LETTERS, norows, replace = TRUE),
Location2 = sample(paste0(LETTERS, 1:26), norows, replace = TRUE),
CountA = sample(100, size = norows, replace = TRUE),
CountB = sample(100, size = norows, replace = TRUE)
)
DTX[, sumifs_result := DTX[.SD,
on = .(Location1, Location2),
by = .EACHI,
sum(CountB[x.CountA < i.CountA])
]$V1][]
The approch works as expected, but is slow for large data. Therefore, I am looking for a more efficient approach.
I do not want to sort; the original order must be preserved. I would prefer to use only the data.table package.
The criteria boil down to a conditional cumsum so it will be much faster and more memory efficient to sort the data, perform the computation and restore to the original order than the join and group approach being attempted.
library(data.table)
library(collapse)
DTX[, idx := .I
][order(CountA), sumifs_result := c(0, fcumsum(CountB)[-.N])[fmatch(CountA, CountA)],
by = .(Location1, Location2)
][order(idx)
][, idx := NULL
]
This is about 750x faster on your example data than the original solution. Note you can replace fmatch()/fcumsum() with match()/cumsum() to remove the collapse dependency but these make the solution about twice as fast compared to the base versions.
Another data.table approach. Optimised for the example data.
foo <- function(dt) {
LUT <- dt[, .(v = sum(CountB)), keyby = .(Location1, Location2, CountA)
][, v := cumsum(v), by = .(Location1, Location2)
][, v := shift(v, fill = 0), by = .(Location1, Location2)]
dt[, sumifs_result := LUT[.SD, on = .(Location1, Location2, CountA), v]]
}
Benchmarking against the fastest solution thus far.
identical(jblood(dt), foo(dt))
# [1] TRUE
microbenchmark::microbenchmark(
sbaldur = foo(dt),
jblood = jblood(dt),
setup = {dt <- copy(DTX)},
unit = "relative",
times = 25
)
# Unit: relative
# expr min lq mean median uq max neval
# sbaldur 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 25
# jblood 2.294585 2.080021 1.905392 1.971365 1.783897 1.628229 25
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With