What is the most efficient way to determine the maximum positive difference between the value (X) for each row and the subsequent values of the same variable (X) within group (Y) in data.table in R.
Example:
set.seed(1)
dt <- data.table(X = sample(100:200, 500455, replace = TRUE),
Y = unlist(sapply(10:1000, function(x) rep(x, x))))
Here's my solution which I consider ineffective and slow:
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
head(dt, 21)
X Y max_diff
1: 126 10 69
2: 137 10 58
3: 157 10 38
4: 191 10 4
5: 120 10 75
6: 190 10 5
7: 195 10 0
8: 166 10 0
9: 163 10 0
10: 106 10 0
11: 120 11 80
12: 117 11 83
13: 169 11 31
14: 138 11 62
15: 177 11 23
16: 150 11 50
17: 172 11 28
18: 200 11 0
19: 138 11 56
20: 178 11 16
21: 194 11 0
If you can advise the efficient (faster) solution?
Here's a dplyr solution that is about 20x faster and gets the same results. I presume the data.table equivalent would be yet faster. (EDIT: see bottom - it is!)
The speedup comes from reducing how many comparisons need to be performed. The largest difference will always be found against the largest remaining number in the group, so it's faster to identify that number first and do only the one subtraction per row.
First, the original solution takes about 4 sec on my machine:
tictoc::tic("OP data.table")
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
tictoc::toc()
# OP data.table: 4.594 sec elapsed
But in only 0.2 sec we can take that data.table, convert to a data frame, add the orig_row row number, group by Y, reverse sort by orig_row, take the difference between X and the cumulative max of X, ungroup, and rearrange in original order:
library(dplyr)
tictoc::tic("dplyr")
dt2 <- dt %>%
as_data_frame() %>%
mutate(orig_row = row_number()) %>%
group_by(Y) %>%
arrange(-orig_row) %>%
mutate(max_diff2 = cummax(X) - X) %>%
ungroup() %>%
arrange(orig_row)
tictoc::toc()
# dplyr: 0.166 sec elapsed
all.equal(dt2$max_diff, dt2$max_diff2)
#[1] TRUE
EDIT: as @david-arenburg suggests in the comments, this can be done lightning-fast in data.table with an elegant line:
dt[.N:1, max_diff2 := cummax(X) - X, by = Y]
On my computer, that's about 2-4x faster than the dplyr solution above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With