Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data table: strangely poor performance in subsetting

I was under the impression that data.table is extremely well optimized, so I was quite surprised to see this:

library(data.table)

SimData <- data.table(ID = sample(1:4e5, 4e6, replace = TRUE),
                      DATE = sample(seq(as.Date("2000-01-01"), as.Date("2019-12-31"), by = "day"),
                                    4e6, replace = TRUE))

microbenchmark::microbenchmark(SimData[ID==1&DATE>="2005-01-01"])
microbenchmark::microbenchmark(SimData[ID==1][DATE>="2005-01-01"])

The two solutions are quite obviously the same, yet there is more than an order of magnitude difference in runtime. Is it possible that data.table performs so poorly with the first form? (I.e., that it can't automatically optimize this call.) Or I overlook something here...?

like image 776
Tamas Ferenci Avatar asked Oct 27 '25 05:10

Tamas Ferenci


1 Answers

The long operation is SimData[DATE>="2005-01-01"] because it returns millions of rows.

microbenchmark::microbenchmark(SimData[DATE>="2005-01-01"],SimData[ID==1])
Unit: microseconds
                          expr     min       lq     mean   median       uq      max neval
 SimData[DATE >= "2005-01-01"] 32542.8 44549.55 51323.53 47529.75 50258.10 117396.3   100
              SimData[ID == 1]   820.0  1043.55  1397.79  1435.15  1688.25   2302.5   100

SimData[ID == 1] is much shorter as it returns only a few rows,

When you execute SimData[ID==1&DATE>="2005-01-01"], you force both evaluations on all rows.

With SimData[ID==1][DATE>="2005-01-01"] the quick operation is done first, and the subsequent filter is also quick because applied on only a few row.

As mentioned by @jangorecki, there is a room for improvement in that matter.

like image 79
Waldi Avatar answered Oct 29 '25 21:10

Waldi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!