Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rolling join with multiple matches

I have two data.tables, dt1 and dt2. I want to join the two tables by a unique ID and by date. The dates might or might not match; hence, I have to use the roll feature. However, dt2 can have multiple rows with the same ID and date and I want to match all of those rows.

A minimal example of the data structure:

library(data.table)
dt1 <- data.table(ID = 1:2,
                  Date = c(dmy("31122021"), dmy("31122022")),
                  Value1 = c("A", "B"))

dt2 <- data.table(ID = c(rep(1, times=6), rep(2, 8)),
                  Company = c(1:3, 1:3, 1:4, 1:4),
                  Date = c(rep(dmy("31122021"), times=3),
                           rep(dmy("31032022"), times=3),
                           rep(dmy("31032023"), times=4),
                           rep(dmy("30062023"), times=4)),
                  Value2 = 1:14)

setkey(dt1, ID, Date)
setkey(dt2, ID, Date)

dt1
   ID       Date Value1      Date1
1:  1 2021-12-31      A 2021-12-31
2:  2 2022-12-31      B 2022-12-31

dt2
    ID Company       Date Value2
 1:  1       1 2021-12-31      1
 2:  1       2 2021-12-31      2
 3:  1       3 2021-12-31      3
 4:  1       1 2022-03-31      4
 5:  1       2 2022-03-31      5
 6:  1       3 2022-03-31      6
 7:  2       1 2023-03-31      7
 8:  2       2 2023-03-31      8
 9:  2       3 2023-03-31      9
10:  2       4 2023-03-31     10
11:  2       1 2023-06-30     11
12:  2       2 2023-06-30     12
13:  2       3 2023-06-30     13
14:  2       4 2023-06-30     14

If I do the following, I only get a data.table with the row that includes Value2=7:

dt2[dt1, roll = "nearest"]
   ID Company       Date Value2 Value1      Date1
1:  1       1 2021-12-31      1      A 2021-12-31
2:  1       2 2021-12-31      2      A 2021-12-31
3:  1       3 2021-12-31      3      A 2021-12-31
4:  2       1 2022-12-31      7      B 2022-12-31

However, Value2=7 to 10 have all the same ID and date, so I would like all to be returned.

The following code gets me the desired result, but it is rather cumbersome and feels like a hack:

dt1[,Date1:=Date]
dt <- dt1[dt2, roll="nearest"]
dt[,DiffDate:=abs(as.numeric(Date1-Date))]
dt[,MinDiffDate:=min(DiffDate),by=ID]
dt <- dt[DiffDate==MinDiffDate]
dt

   ID       Date Value1      Date1 Company Value2 DiffDate MinDiffDate
1:  1 2021-12-31      A 2021-12-31       1      1        0           0
2:  1 2021-12-31      A 2021-12-31       2      2        0           0
3:  1 2021-12-31      A 2021-12-31       3      3        0           0
4:  2 2023-03-31      B 2022-12-31       1      7       90          90
5:  2 2023-03-31      B 2022-12-31       2      8       90          90
6:  2 2023-03-31      B 2022-12-31       3      9       90          90
7:  2 2023-03-31      B 2022-12-31       4     10       90          90

Other questions on SO had similar issues, such as here and here. But the first link implies there is no solution and the solution proposed in the second link does not generate the desired result:

#https://stackoverflow.com/questions/43540746/rolling-joins-on-multiple-matches
DT1 <- dt2[dt1, roll="nearest"]
DT2 <- dt1[dt2, roll = "nearest"]
dt_match <- unique(rbindlist(list(DT1, DT2), use.names=TRUE))
dt_match

    ID Company       Date Value2 Value1
 1:  1       1 2021-12-31      1      A
 2:  1       2 2021-12-31      2      A
 3:  1       3 2021-12-31      3      A
 4:  2       1 2022-12-31      7      B
 5:  1       1 2022-03-31      4      A
 6:  1       2 2022-03-31      5      A
 7:  1       3 2022-03-31      6      A
 8:  2       1 2023-03-31      7      B
 9:  2       2 2023-03-31      8      B
10:  2       3 2023-03-31      9      B
11:  2       4 2023-03-31     10      B
12:  2       1 2023-06-30     11      B
13:  2       2 2023-06-30     12      B
14:  2       3 2023-06-30     13      B
15:  2       4 2023-06-30     14      B

Is there a better solution than the one proposed by me to perform a rolling merge with several matches?

Update as response to r2evans' answer

dt1 includes many more columns that I want to keep - I adjust it now to include two more:

dt1 <- data.table(ID = 1:2,
                  Date = c(dmy("31122021"), dmy("31122022")),
                  Value1 = c("A", "B"),
                  Info1  = c("Info x", "Info y"),
                  Info2  = c("Info 1", "Info 2"))

How can I adjust the call dt2[, Date1:=Date][dt1, .(ID, Date=Date1, Value1), roll=-Inf] in such a way that I don't have to manually type the columns? I tried something like this but this doesn't work - I guess I can only use this to get columns, not also rename them:

colnames <- c("Date=Date1", names(dt1)[names(dt1)!="Date"])
dt2[
  unique(dt2[, Date1 := Date][dt1, colnames, with=FALSE, roll = -Inf]),
  on=.(ID, Date)]
like image 795
Christoph_J Avatar asked Oct 17 '25 07:10

Christoph_J


2 Answers

We can keep it to a single join and avoid filtering on a larger-than-necessary join by "collapsing" Company and Value2, performing the rolling join, then "expanding" Company and Value2 back out.

dt2[
  # "collapse" Company and Value2
  ,.(Company = .(Company), Value2 = .(Value2), Date2 = Date), .(ID, Date)
][
  dt1, on = .(ID, Date), roll = "nearest" # perform the rolling join
][
  # "expand" the collapsed columns back out
  ,lapply(.SD, \(x) if (is.list(x)) unlist(x) else rep.int(x, lengths(Company)))
]
#>    ID       Date Company Value2      Date2 Value1  Info1  Info2
#> 1:  1 2021-12-31       1      1 2021-12-31      A Info x Info 1
#> 2:  1 2021-12-31       2      2 2021-12-31      A Info x Info 1
#> 3:  1 2021-12-31       3      3 2021-12-31      A Info x Info 1
#> 4:  2 2022-12-31       1      7 2023-03-31      B Info y Info 2
#> 5:  2 2022-12-31       2      8 2023-03-31      B Info y Info 2
#> 6:  2 2022-12-31       3      9 2023-03-31      B Info y Info 2
#> 7:  2 2022-12-31       4     10 2023-03-31      B Info y Info 2

Data

library(lubridate)
library(data.table)

dt1 <- data.table(ID = 1:2,
                  Date = c(dmy("31122021"), dmy("31122022")),
                  Value1 = c("A", "B"),
                  Info1  = c("Info x", "Info y"),
                  Info2  = c("Info 1", "Info 2"))

dt2 <- data.table(ID = c(rep(1, times=6), rep(2, 8)),
                  Company = c(1:3, 1:3, 1:4, 1:4),
                  Date = c(rep(dmy("31122021"), times=3),
                           rep(dmy("31032022"), times=3),
                           rep(dmy("31032023"), times=4),
                           rep(dmy("30062023"), times=4)),
                  Value2 = 1:14)

Benchmarking

I should point out that a double join is more performant, if that is a concern.

Solutions as functions:

fCollapse <- function(dt1, dt2) {
  setkey(
    setcolorder(
      dt2[
        # "collapse" Company and Value2
        ,.(Company = .(Company), Value2 = .(Value2), Date2 = Date), .(ID, Date)
      ][
        dt1, on = .(ID, Date), roll = "nearest" # perform the rolling join
      ][
        # "expand" the collapsed columns back out
        ,lapply(.SD, \(x) if (is.list(x)) unlist(x) else rep.int(x, lengths(Company)))
      ], c(names(dt1), "Company", "Date2", "Value2")
    ), ID, Date2, Date
  )
}

The double join (modified from @r2Evans' answer):

fDouble <- function(dt1, dt2) {
  setkey(dt1, ID, Date)
  setkey(dt2, ID, Date)
  setcolorder(
    setnames(
      dt2[
        unique(
          dt2[, Date1 := Date][dt1, roll = "nearest"][
            , c("Company", "Value2") := NULL
          ]
        ),
        on = .(ID, Date = Date1)
      ][,Date1 := NULL],
      c("i.Date", "Date"), c("Date", "Date2")
    ), c(names(dt1), "Company", "Date2", "Value2")
  )
}

Create a larger benchmarking dataset.

dt1 <- unique(
  data.table(
    ID = sample(100, 1e4, 1),
    Date = as.Date(sample(18000:19000, 1e4, 1))
  )
)[, `:=`(
  Value1 = sample(LETTERS, .N, 1),
  Info1 = paste("Info", sample(letters, .N, 1)),
  Info2 = paste("Info", sample(100, .N, 1))
)]

dt2 <- data.table(
  ID = sample(100, 1e4, 1),
  Company = sample(100, 1e4, 1),
  Date = as.Date(sample(18000:19000, 1e4, 1)),
  Value2 = 1:1e4
)

Check that the answers are the same.

identical(
  setorder(fDouble(copy(dt1), copy(dt2))),
  setorder(fCollapse(copy(dt1), copy(dt2)))
)
#> [1] TRUE

Benchmark:

microbenchmark::microbenchmark(
  fCollapse = fCollapse(DT1, DT2),
  fDouble = fDouble(DT1, DT2),
  setup = {DT1 <- copy(dt1); DT2 <- copy(dt2)}
)
#> Unit: milliseconds
#>       expr     min       lq      mean   median       uq     max neval
#>  fCollapse 34.1741 37.45145 40.526797 39.28565 41.27445 94.1319   100
#>    fDouble  7.3764  7.67590  8.094951  7.85010  8.10155 12.4141   100
like image 75
jblood94 Avatar answered Oct 18 '25 20:10

jblood94


Not much better than your hack, but a different approach:

dt2[
  unique(dt2[, Date1 := Date][dt1, .(ID, Date=Date1, Value1), roll = -Inf]),
  on=.(ID, Date)]
#       ID Company       Date Value2      Date1 Value1
#    <int>   <int>     <Date>  <int>     <Date> <char>
# 1:     1       1 2021-12-31      1 2021-12-31      A
# 2:     1       2 2021-12-31      2 2021-12-31      A
# 3:     1       3 2021-12-31      3 2021-12-31      A
# 4:     2       1 2023-03-31      7 2023-03-31      B
# 5:     2       2 2023-03-31      8 2023-03-31      B
# 6:     2       3 2023-03-31      9 2023-03-31      B
# 7:     2       4 2023-03-31     10 2023-03-31      B

It still has the (slightly-)unenviable need for Date1, necessary because without it, our intermediate step produces

dt2[dt1, .(ID, Date, Value1), roll=-Inf]
#       ID       Date Value1
#    <int>     <Date> <char>
# 1:     1 2021-12-31      A
# 2:     1 2021-12-31      A
# 3:     1 2021-12-31      A
# 4:     2 2022-12-31      B

which does not let us join back on dt2 with its original Date. Instead, we do this:

dt2[, Date1:=Date][dt1, .(ID, Date=Date1, Value1), roll=-Inf]
#       ID       Date Value1
#    <int>     <Date> <char>
# 1:     1 2021-12-31      A
# 2:     1 2021-12-31      A
# 3:     1 2021-12-31      A
# 4:     2 2023-03-31      B

in order to preserve the original Date so that we can join it back (after unique(.)) back onto dt2 to bring all associated rows back in.

like image 22
r2evans Avatar answered Oct 18 '25 21:10

r2evans