R data table: strangely poor performance in subsetting

Question

I was under the impression that data.table is extremely well optimized, so I was quite surprised to see this:

library(data.table)

SimData <- data.table(ID = sample(1:4e5, 4e6, replace = TRUE),
                      DATE = sample(seq(as.Date("2000-01-01"), as.Date("2019-12-31"), by = "day"),
                                    4e6, replace = TRUE))

microbenchmark::microbenchmark(SimData[ID==1&DATE>="2005-01-01"])
microbenchmark::microbenchmark(SimData[ID==1][DATE>="2005-01-01"])

The two solutions are quite obviously the same, yet there is more than an order of magnitude difference in runtime. Is it possible that data.table performs so poorly with the first form? (I.e., that it can't automatically optimize this call.) Or I overlook something here...?

Waldi · Accepted Answer

The long operation is SimData[DATE>="2005-01-01"] because it returns millions of rows.

microbenchmark::microbenchmark(SimData[DATE>="2005-01-01"],SimData[ID==1])
Unit: microseconds
                          expr     min       lq     mean   median       uq      max neval
 SimData[DATE >= "2005-01-01"] 32542.8 44549.55 51323.53 47529.75 50258.10 117396.3   100
              SimData[ID == 1]   820.0  1043.55  1397.79  1435.15  1688.25   2302.5   100

SimData[ID == 1] is much shorter as it returns only a few rows,

When you execute SimData[ID==1&DATE>="2005-01-01"], you force both evaluations on all rows.

With SimData[ID==1][DATE>="2005-01-01"] the quick operation is done first, and the subsequent filter is also quick because applied on only a few row.

As mentioned by @jangorecki, there is a room for improvement in that matter.

R data table: strangely poor performance in subsetting

Tags:

optimization

r

data.table

Tamas Ferenci

1 Answers

Waldi

Recent Activity

Donate For Us

R data table: strangely poor performance in subsetting

Tags:

optimization

r

data.table

Tamas Ferenci

1 Answers

Waldi

Related questions

Recent Activity

Donate For Us