I was under the impression that data.table is extremely well optimized, so I was quite surprised to see this:
library(data.table)
SimData <- data.table(ID = sample(1:4e5, 4e6, replace = TRUE),
DATE = sample(seq(as.Date("2000-01-01"), as.Date("2019-12-31"), by = "day"),
4e6, replace = TRUE))
microbenchmark::microbenchmark(SimData[ID==1&DATE>="2005-01-01"])
microbenchmark::microbenchmark(SimData[ID==1][DATE>="2005-01-01"])
The two solutions are quite obviously the same, yet there is more than an order of magnitude difference in runtime. Is it possible that data.table performs so poorly with the first form? (I.e., that it can't automatically optimize this call.) Or I overlook something here...?
The long operation is SimData[DATE>="2005-01-01"] because it returns millions of rows.
microbenchmark::microbenchmark(SimData[DATE>="2005-01-01"],SimData[ID==1])
Unit: microseconds
expr min lq mean median uq max neval
SimData[DATE >= "2005-01-01"] 32542.8 44549.55 51323.53 47529.75 50258.10 117396.3 100
SimData[ID == 1] 820.0 1043.55 1397.79 1435.15 1688.25 2302.5 100
SimData[ID == 1] is much shorter as it returns only a few rows,
When you execute SimData[ID==1&DATE>="2005-01-01"], you force both evaluations on all rows.
With SimData[ID==1][DATE>="2005-01-01"] the quick operation is done first, and the subsequent filter is also quick because applied on only a few row.
As mentioned by @jangorecki, there is a room for improvement in that matter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With