I have two data.tables, dt1
and dt2
. I want to join the two tables by a unique ID and by date. The dates might or might not match; hence, I have to use the roll
feature. However, dt2
can have multiple rows with the same ID and date and I want to match all of those rows.
A minimal example of the data structure:
library(data.table)
dt1 <- data.table(ID = 1:2,
Date = c(dmy("31122021"), dmy("31122022")),
Value1 = c("A", "B"))
dt2 <- data.table(ID = c(rep(1, times=6), rep(2, 8)),
Company = c(1:3, 1:3, 1:4, 1:4),
Date = c(rep(dmy("31122021"), times=3),
rep(dmy("31032022"), times=3),
rep(dmy("31032023"), times=4),
rep(dmy("30062023"), times=4)),
Value2 = 1:14)
setkey(dt1, ID, Date)
setkey(dt2, ID, Date)
dt1
ID Date Value1 Date1
1: 1 2021-12-31 A 2021-12-31
2: 2 2022-12-31 B 2022-12-31
dt2
ID Company Date Value2
1: 1 1 2021-12-31 1
2: 1 2 2021-12-31 2
3: 1 3 2021-12-31 3
4: 1 1 2022-03-31 4
5: 1 2 2022-03-31 5
6: 1 3 2022-03-31 6
7: 2 1 2023-03-31 7
8: 2 2 2023-03-31 8
9: 2 3 2023-03-31 9
10: 2 4 2023-03-31 10
11: 2 1 2023-06-30 11
12: 2 2 2023-06-30 12
13: 2 3 2023-06-30 13
14: 2 4 2023-06-30 14
If I do the following, I only get a data.table with the row that includes Value2=7:
dt2[dt1, roll = "nearest"]
ID Company Date Value2 Value1 Date1
1: 1 1 2021-12-31 1 A 2021-12-31
2: 1 2 2021-12-31 2 A 2021-12-31
3: 1 3 2021-12-31 3 A 2021-12-31
4: 2 1 2022-12-31 7 B 2022-12-31
However, Value2=7 to 10 have all the same ID and date, so I would like all to be returned.
The following code gets me the desired result, but it is rather cumbersome and feels like a hack:
dt1[,Date1:=Date]
dt <- dt1[dt2, roll="nearest"]
dt[,DiffDate:=abs(as.numeric(Date1-Date))]
dt[,MinDiffDate:=min(DiffDate),by=ID]
dt <- dt[DiffDate==MinDiffDate]
dt
ID Date Value1 Date1 Company Value2 DiffDate MinDiffDate
1: 1 2021-12-31 A 2021-12-31 1 1 0 0
2: 1 2021-12-31 A 2021-12-31 2 2 0 0
3: 1 2021-12-31 A 2021-12-31 3 3 0 0
4: 2 2023-03-31 B 2022-12-31 1 7 90 90
5: 2 2023-03-31 B 2022-12-31 2 8 90 90
6: 2 2023-03-31 B 2022-12-31 3 9 90 90
7: 2 2023-03-31 B 2022-12-31 4 10 90 90
Other questions on SO had similar issues, such as here and here. But the first link implies there is no solution and the solution proposed in the second link does not generate the desired result:
#https://stackoverflow.com/questions/43540746/rolling-joins-on-multiple-matches
DT1 <- dt2[dt1, roll="nearest"]
DT2 <- dt1[dt2, roll = "nearest"]
dt_match <- unique(rbindlist(list(DT1, DT2), use.names=TRUE))
dt_match
ID Company Date Value2 Value1
1: 1 1 2021-12-31 1 A
2: 1 2 2021-12-31 2 A
3: 1 3 2021-12-31 3 A
4: 2 1 2022-12-31 7 B
5: 1 1 2022-03-31 4 A
6: 1 2 2022-03-31 5 A
7: 1 3 2022-03-31 6 A
8: 2 1 2023-03-31 7 B
9: 2 2 2023-03-31 8 B
10: 2 3 2023-03-31 9 B
11: 2 4 2023-03-31 10 B
12: 2 1 2023-06-30 11 B
13: 2 2 2023-06-30 12 B
14: 2 3 2023-06-30 13 B
15: 2 4 2023-06-30 14 B
Is there a better solution than the one proposed by me to perform a rolling merge with several matches?
Update as response to r2evans' answer
dt1
includes many more columns that I want to keep - I adjust it now to include two more:
dt1 <- data.table(ID = 1:2,
Date = c(dmy("31122021"), dmy("31122022")),
Value1 = c("A", "B"),
Info1 = c("Info x", "Info y"),
Info2 = c("Info 1", "Info 2"))
How can I adjust the call dt2[, Date1:=Date][dt1, .(ID, Date=Date1, Value1), roll=-Inf]
in such a way that I don't have to manually type the columns? I tried something like this but this doesn't work - I guess I can only use this to get columns, not also rename them:
colnames <- c("Date=Date1", names(dt1)[names(dt1)!="Date"])
dt2[
unique(dt2[, Date1 := Date][dt1, colnames, with=FALSE, roll = -Inf]),
on=.(ID, Date)]
We can keep it to a single join and avoid filtering on a larger-than-necessary join by "collapsing" Company
and Value2
, performing the rolling join, then "expanding" Company
and Value2
back out.
dt2[
# "collapse" Company and Value2
,.(Company = .(Company), Value2 = .(Value2), Date2 = Date), .(ID, Date)
][
dt1, on = .(ID, Date), roll = "nearest" # perform the rolling join
][
# "expand" the collapsed columns back out
,lapply(.SD, \(x) if (is.list(x)) unlist(x) else rep.int(x, lengths(Company)))
]
#> ID Date Company Value2 Date2 Value1 Info1 Info2
#> 1: 1 2021-12-31 1 1 2021-12-31 A Info x Info 1
#> 2: 1 2021-12-31 2 2 2021-12-31 A Info x Info 1
#> 3: 1 2021-12-31 3 3 2021-12-31 A Info x Info 1
#> 4: 2 2022-12-31 1 7 2023-03-31 B Info y Info 2
#> 5: 2 2022-12-31 2 8 2023-03-31 B Info y Info 2
#> 6: 2 2022-12-31 3 9 2023-03-31 B Info y Info 2
#> 7: 2 2022-12-31 4 10 2023-03-31 B Info y Info 2
Data
library(lubridate)
library(data.table)
dt1 <- data.table(ID = 1:2,
Date = c(dmy("31122021"), dmy("31122022")),
Value1 = c("A", "B"),
Info1 = c("Info x", "Info y"),
Info2 = c("Info 1", "Info 2"))
dt2 <- data.table(ID = c(rep(1, times=6), rep(2, 8)),
Company = c(1:3, 1:3, 1:4, 1:4),
Date = c(rep(dmy("31122021"), times=3),
rep(dmy("31032022"), times=3),
rep(dmy("31032023"), times=4),
rep(dmy("30062023"), times=4)),
Value2 = 1:14)
I should point out that a double join is more performant, if that is a concern.
Solutions as functions:
fCollapse <- function(dt1, dt2) {
setkey(
setcolorder(
dt2[
# "collapse" Company and Value2
,.(Company = .(Company), Value2 = .(Value2), Date2 = Date), .(ID, Date)
][
dt1, on = .(ID, Date), roll = "nearest" # perform the rolling join
][
# "expand" the collapsed columns back out
,lapply(.SD, \(x) if (is.list(x)) unlist(x) else rep.int(x, lengths(Company)))
], c(names(dt1), "Company", "Date2", "Value2")
), ID, Date2, Date
)
}
The double join (modified from @r2Evans' answer):
fDouble <- function(dt1, dt2) {
setkey(dt1, ID, Date)
setkey(dt2, ID, Date)
setcolorder(
setnames(
dt2[
unique(
dt2[, Date1 := Date][dt1, roll = "nearest"][
, c("Company", "Value2") := NULL
]
),
on = .(ID, Date = Date1)
][,Date1 := NULL],
c("i.Date", "Date"), c("Date", "Date2")
), c(names(dt1), "Company", "Date2", "Value2")
)
}
Create a larger benchmarking dataset.
dt1 <- unique(
data.table(
ID = sample(100, 1e4, 1),
Date = as.Date(sample(18000:19000, 1e4, 1))
)
)[, `:=`(
Value1 = sample(LETTERS, .N, 1),
Info1 = paste("Info", sample(letters, .N, 1)),
Info2 = paste("Info", sample(100, .N, 1))
)]
dt2 <- data.table(
ID = sample(100, 1e4, 1),
Company = sample(100, 1e4, 1),
Date = as.Date(sample(18000:19000, 1e4, 1)),
Value2 = 1:1e4
)
Check that the answers are the same.
identical(
setorder(fDouble(copy(dt1), copy(dt2))),
setorder(fCollapse(copy(dt1), copy(dt2)))
)
#> [1] TRUE
Benchmark:
microbenchmark::microbenchmark(
fCollapse = fCollapse(DT1, DT2),
fDouble = fDouble(DT1, DT2),
setup = {DT1 <- copy(dt1); DT2 <- copy(dt2)}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> fCollapse 34.1741 37.45145 40.526797 39.28565 41.27445 94.1319 100
#> fDouble 7.3764 7.67590 8.094951 7.85010 8.10155 12.4141 100
Not much better than your hack, but a different approach:
dt2[
unique(dt2[, Date1 := Date][dt1, .(ID, Date=Date1, Value1), roll = -Inf]),
on=.(ID, Date)]
# ID Company Date Value2 Date1 Value1
# <int> <int> <Date> <int> <Date> <char>
# 1: 1 1 2021-12-31 1 2021-12-31 A
# 2: 1 2 2021-12-31 2 2021-12-31 A
# 3: 1 3 2021-12-31 3 2021-12-31 A
# 4: 2 1 2023-03-31 7 2023-03-31 B
# 5: 2 2 2023-03-31 8 2023-03-31 B
# 6: 2 3 2023-03-31 9 2023-03-31 B
# 7: 2 4 2023-03-31 10 2023-03-31 B
It still has the (slightly-)unenviable need for Date1
, necessary because without it, our intermediate step produces
dt2[dt1, .(ID, Date, Value1), roll=-Inf]
# ID Date Value1
# <int> <Date> <char>
# 1: 1 2021-12-31 A
# 2: 1 2021-12-31 A
# 3: 1 2021-12-31 A
# 4: 2 2022-12-31 B
which does not let us join back on dt2
with its original Date
. Instead, we do this:
dt2[, Date1:=Date][dt1, .(ID, Date=Date1, Value1), roll=-Inf]
# ID Date Value1
# <int> <Date> <char>
# 1: 1 2021-12-31 A
# 2: 1 2021-12-31 A
# 3: 1 2021-12-31 A
# 4: 2 2023-03-31 B
in order to preserve the original Date
so that we can join it back (after unique(.)
) back onto dt2
to bring all associated rows back in.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With