Can someone please explain what happens in this interactively run R code? I'm using dplyr 1.1.2 and data.table 1.14.8 on R 3.6.3 on Ubuntu Linux. Note that only after the line implicitly assigning an index to dt1 does this discrepancy occur. Why does dt[cond] give a different result to subset(dt, cond) in this case? And is it just generally "dangerous" to combine data.table and tidyverse functions in a chain?
library(data.table)
library(dplyr)
mydt1 = data.table(year=rep(2017:2018, each=3), month=rep(1:3, times=2))
mydt2 = data.table(year=rep(2016:2017, each=3), month=rep(4:6, times=2))
mydt1[year == 2018] # this appears to assigns `year` as an index to mydt1
rbindlist(list(mydt1, mydt2))[year == 2017]
# produces expected output:
# year month
# 1: 2017 1
# 2: 2017 2
# 3: 2017 3
# 4: 2017 4
# 5: 2017 5
# 6: 2017 6
subset(bind_rows(mydt1, mydt2), year == 2017)
# produces the same output as above
bind_rows(mydt1, mydt2)[year == 2017]
# what happens now?
# year month
# 1: 2017 1
# 2: 2017 2
# 3: 2017 3
What you're seeing is "auto-indexing". With data.table, there can be only one physical index (i.e., setkey), but there can also be secondary indices. As a feature (for faster subetting), whenever a subset is done using strict == or %in%, a secondary index is created and stored with the object.
With data.table objects, when combining tables, the indices are dropped, as one might expect:
indices(mydt1)
# [1] "year" "year__month"
indices(mydt2)
# NULL
rbindlist(list(mydt1, mydt2)) |>
indices()
# NULL
rbindlist(list(mydt2, mydt1)) |>
indices()
# NULL
But with dplyr_1.1.2 (at least, not sure if/when things changed), it is preserving attributes of the first table.
bind_rows(mydt1, mydt2) |>
indices()
# [1] "year" "year__month"
bind_rows(mydt2, mydt1) |>
indices()
# NULL
This latter discovery suggests that we can get the desired results by reversing the tables:
bind_rows(mydt1, mydt2)[year == 2017]
# year month
# <int> <int>
# 1: 2017 1
# 2: 2017 2
# 3: 2017 3
bind_rows(mydt2, mydt1)[year == 2017]
# year month
# <int> <int>
# 1: 2017 4
# 2: 2017 5
# 3: 2017 6
# 4: 2017 1
# 5: 2017 2
# 6: 2017 3
From the documentation (same link above), we can set the "datatable.use.index" option to prevent this behavior. In a new R session (since I didn't want to spoil things):
options(datatable.use.index = FALSE)
mydt1 = data.table(year=rep(2017:2018, each=3), month=rep(1:3, times=2))
mydt2 = data.table(year=rep(2016:2017, each=3), month=rep(4:6, times=2))
mydt1[year == 2018]
# year month
# <int> <int>
# 1: 2018 1
# 2: 2018 2
# 3: 2018 3
indices(mydt1)
# NULL
bind_rows(mydt1, mydt2)[year == 2017]
# year month
# <int> <int>
# 1: 2017 1
# 2: 2017 2
# 3: 2017 3
# 4: 2017 4
# 5: 2017 5
# 6: 2017 6
So my guess is that this is not a "bug" in either dplyr or data.table, but it sure would be nice if dplyr "knew" about this secondary index and chose to not pass this one through when it row-binds them together.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With