I am trying to subset the same data frame by a list of ids. I have one dataframe of about 50,000 rows and a list of about 1,000 data frames. Each data frame in the list has between 100 and 1000 rows and has the same structure.
Consider this example:
df1 <- data.frame(id = sample(sample(1000:3000, 1000), 50000, TRUE), info = runif(50000, 200, 300))
set.seed(1)
l <- replicate(1000,
data.frame(id = sample(1000:3000, sample(400:700, 1), replace = TRUE)),
simplify = FALSE)
I want to subset df1
depending on the ids in l
. I can do that by performing a semi-join, or subsetting with %in%
:
library(dplyr)
semi_join(df1, l[[1]], "id")
df1[df1$id %in% l[[1]]$id, ]
I'm looking for a fast solution that scales to a list of thousands of data frames. So far, I wrap this in lapply
(but there might be faster, vectorized solutions).
lapply(l, \(x) semi_join(df1, x, "id"))
Here's a starting benchmark with the solutions:
bc <-
bench::mark(dplyr = lapply(l, \(x) semi_join(df1, x, "id")),
baseR = lapply(l, \(x) df1[df1$id %in% x$id, ]),
unique = lapply(l, \(x) df1[df1$id %in% unique(x$id), ]),
data.table = {df2 <- setDT(df1); lapply(l, \(x) df2[df2$id %in% unique(x$id), ])},
iterations = 10, check = FALSE)
#> bc
# A tibble: 4 × 13
# expression min median `itr/sec` mem_alloc gc/se…¹ n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#1 dplyr 2.25s 2.43s 0.416 1.64GB 4.03 10 97
#2 baseR 9.04s 9.55s 0.105 1.56GB 0.536 10 51
#3 unique 10.3s 10.95s 0.0912 1.57GB 0.420 10 46
#4 data.table 10.21s 10.9s 0.0916 979.5MB 0.458 10 50
Any package can be used.
A one-shot data.table
join on an rbindlist
will be fast.
library(data.table)
library(dplyr)
microbenchmark::microbenchmark(
dplyr = lapply(l, \(x) semi_join(df1, x, "id")),
baseR = lapply(l, \(x) df1[df1$id %in% x$id, ]),
unique = lapply(l, \(x) df1[df1$id %in% unique(x$id), ]),
data.table = {setDT(df2); lapply(l, \(x) df2[id %in% unique(x$id)])},
bindJoinSplit = split(setDT(df2, key = "id")[unique(rbindlist(l, idcol = "df")), on = "id", allow.cartesian = TRUE, nomatch = 0], by = "df", keep.by = FALSE),
times = 10,
setup = df2 <- copy(df1)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> dplyr 1555.2198 1569.4590 1630.778 1619.9118 1626.8791 1857.1259 10
#> baseR 1079.4916 1087.2524 1131.043 1126.8084 1179.5400 1196.9150 10
#> unique 1171.3705 1214.4915 1253.490 1234.3908 1274.9278 1398.2908 10
#> data.table 1925.1388 1950.9440 1978.538 1982.5494 1995.3917 2038.3573 10
#> bindJoinSplit 368.8109 380.4029 412.595 401.4089 437.4869 503.8478 10
If the binding and splitting could be avoided, this would be much faster and easier. From ?data.table::split
:
Be aware that processing list of data.tables will be generally much slower than manipulation in single data.table by group using by argument
Say the data had been built into a single data.table
instead of a list of data.frame
s. We can also keep the results in a single data.table
instead of list of data.frame
s. This is faster still.
dt <- rbindlist(l, idcol = "df")
microbenchmark::microbenchmark(
bindJoinSplit = split(setDT(df2, key = "id")[unique(rbindlist(l, idcol = "df")), on = "id", allow.cartesian = TRUE, nomatch = 0], by = "df", keep.by = FALSE),
join = setDT(df2, key = "id")[unique(dt), on = "id", allow.cartesian = TRUE, nomatch = 0],
times = 10,
setup = df2 <- copy(df1)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> bindJoinSplit 317.1514 326.3613 343.2588 333.3914 364.1945 398.5467 10
#> join 188.5818 191.4344 198.7380 194.4094 208.7339 219.9847 10
Data:
set.seed(1)
df1 <- data.frame(id = sample(sample(1000:3000, 1000), 50000, TRUE), info = runif(50000, 200, 300))
l <- replicate(1000,
data.frame(id = sample(1000:3000, sample(400:700, 1), replace = TRUE)),
simplify = FALSE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With