Match records immediately before and after time range

Question

My objective is to join two data tables based on time with dplyr or data.table, specifically to get the record immediately before and immediately after an event.

In the example data, the events in this case are scooter trips. Below are four trips - two taken by scooter 1 and two by scooter 2.

> testScooter
                 start                 end id
1: 2018-01-18 22:19:13 2018-01-18 22:26:31  1
2: 2018-01-18 23:29:22 2018-01-18 23:37:53  1
3: 2018-01-18 00:22:02 2018-01-18 00:29:21  2
4: 2018-01-18 00:37:52 2018-01-18 01:06:53  2

In a separate table are records spaced at nearly equal intervals apart. The ids match and the scooter is marked no when a trip is underway.

> intervals
   id                time available charge
1   1 2018-01-18 21:31:07       yes     83
2   1 2018-01-18 21:41:07       yes     83
3   1 2018-01-18 21:51:07       yes     83
4   1 2018-01-18 22:01:07       yes     83
5   1 2018-01-18 22:11:07       yes     83
6   1 2018-01-18 22:21:07        no     83
7   1 2018-01-18 22:31:07       yes     81
8   1 2018-01-18 22:41:08       yes     81
9   1 2018-01-18 22:51:08       yes     81
10  1 2018-01-18 23:01:08       yes     81
11  1 2018-01-18 23:11:08       yes     81
12  1 2018-01-18 23:21:11       yes     81
13  1 2018-01-18 23:31:07        no     81
14  1 2018-01-18 23:41:09       yes     79
15  1 2018-01-18 23:51:07       yes     79
16  2 2018-01-18 00:01:06       yes     84
17  2 2018-01-18 00:11:06       yes     84
18  2 2018-01-18 00:21:06       yes     84
19  2 2018-01-18 00:31:05       yes     80
20  2 2018-01-18 00:41:06        no     80
21  2 2018-01-18 00:51:06        no     80
22  2 2018-01-18 01:01:06        no     80
23  2 2018-01-18 01:11:05       yes     80
24  2 2018-01-18 01:21:05       yes     80
25  2 2018-01-18 01:31:05       yes     80

The output I am trying to produce is the following.

> output
                 start                 end id startCharge endCharge
1: 2018-01-18 22:19:13 2018-01-18 22:26:31  1          83        81
2: 2018-01-18 23:29:22 2018-01-18 23:37:53  1          81        79
3: 2018-01-18 00:22:02 2018-01-18 00:29:21  2          84        80
4: 2018-01-18 00:37:52 2018-01-18 01:06:53  2          80        80

Any suggestions on how to match on nearest time before and after a time range would be helpful, maybe by using lubridate::new_interval() or roll='nearest' from the data.table package but I am not sure where to begin.

# Here is the sample data

library(data.table)

testScooter <- setDT(
structure(list(start = structure(c(1516313953, 1516318162, 1516234922, 
1516235872), tzone = "", class = c("POSIXct", "POSIXt")), end = structure(c(1516314391, 
1516318673, 1516235361, 1516237613), tzone = "", class = c("POSIXct", 
"POSIXt")), id = c(1, 1, 2, 2)), .Names = c("start", "end", "id"
), row.names = c(NA, -4L), class = "data.frame"))

intervals <- 
structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
    time = structure(c(1516311067, 1516311667, 1516312267, 1516312867, 
    1516313467, 1516314067, 1516314667, 1516315268, 1516315868, 
    1516316468, 1516317068, 1516317671, 1516318267, 1516318869, 
    1516319467, 1516233666, 1516234266, 1516234866, 1516235465, 
    1516236066, 1516236666, 1516237266, 1516237865, 1516238465, 
    1516239065), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    available = c("yes", "yes", "yes", "yes", "yes", "no", "yes", 
    "yes", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", 
    "yes", "yes", "yes", "no", "no", "no", "yes", "yes", "yes"
    ), charge = c(83L, 83L, 83L, 83L, 83L, 83L, 81L, 81L, 81L, 
    81L, 81L, 81L, 81L, 79L, 79L, 84L, 84L, 84L, 80L, 80L, 80L, 
    80L, 80L, 80L, 80L)), .Names = c("id", "time", "available", 
"charge"), row.names = c(NA, -25L), class = "data.frame")

Jaap · Accepted Answer

New answer:

You can do this with a double rolling join:

testScooter[, startCharge := intervals[testScooter, on = .(id, time = start), roll = Inf, x.charge]
            ][, endCharge := intervals[testScooter, on = .(id, time = end), roll = -Inf, x.charge]][]

which gives the desired result:

                 start                 end id startCharge endCharge
1: 2018-01-18 23:19:13 2018-01-18 23:26:31  1          83        81
2: 2018-01-19 00:29:22 2018-01-19 00:37:53  1          81        79
3: 2018-01-18 01:22:02 2018-01-18 01:29:21  2          84        80
4: 2018-01-18 01:37:52 2018-01-18 02:06:53  2          80        80

What this does:

roll = Inf looks for the last observation in intervals before start
roll = -Inf looks for the first observation in intervals after end

See also the Note about why the new answer is better.

Old answer:

testScooter[intervals, on = .(id, start = time), roll = -Inf, startCharge := i.charge
            ][intervals, on = .(id, end = time), roll = Inf, endCharge := i.charge][]

Note:

As @Frank noted here on Github, data.table returns the last match in i when there are multiple matches, which is the case for the old answer. See the following output when the code is run with verbose = TRUE:

> testScooter[intervals, on = .(id, start = time), roll = -Inf, startCharge := i.charge, verbose = TRUE][]
Calculated ad hoc index in 0 secs
Starting bmerge ...done in 0 secs
Detected that j uses these columns: startCharge,i.charge 
Assigning to 16 row subset of 4 rows
                 start                 end id startCharge
1: 2018-01-18 22:19:13 2018-01-18 22:26:31  1          83
2: 2018-01-18 23:29:22 2018-01-18 23:37:53  1          81
3: 2018-01-18 00:22:02 2018-01-18 00:29:21  2          84
4: 2018-01-18 00:37:52 2018-01-18 01:06:53  2          80

Although this behavior doesn't lead to any problems in this example, it is less efficient and could possibly lead to unintended results. See this example (courtesy to @Frank):

> data.table(a = 1:2)[data.table(a = c(2L, 2L), v = 3:4), on=.(a), v := i.v, verbose = TRUE][]
Calculated ad hoc index in 0 secs
Starting bmerge ...done in 0 secs
Detected that j uses these columns: v,i.v 
Assigning to 2 row subset of 2 rows
   a  v
1: 1 NA
2: 2  4

The new answer is therefore the better option.

Used data:

testScooter <- structure(list(start = structure(c(1516313953, 1516318162, 1516234922, 1516235872), tzone = "UTC", class = c("POSIXct", "POSIXt")),
                              end = structure(c(1516314391, 1516318673, 1516235361, 1516237613), tzone = "UTC", class = c("POSIXct", "POSIXt")),
                              id = c(1L, 1L, 2L, 2L)),
                         .Names = c("start", "end", "id"), row.names = c(NA, -4L), class = "data.frame")
setDT(testScooter)

intervals <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
                            time = structure(c(1516311067, 1516311667, 1516312267, 1516312867, 1516313467, 1516314067, 1516314667, 1516315268, 1516315868, 1516316468, 1516317068, 1516317671, 1516318267, 1516318869, 1516319467, 1516233666, 1516234266, 1516234866, 1516235465, 1516236066, 1516236666, 1516237266, 1516237865, 1516238465, 1516239065), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
                            available = c("yes", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no", "no", "no", "yes", "yes", "yes"), 
                            charge = c(83L, 83L, 83L, 83L, 83L, 83L, 81L, 81L, 81L, 81L, 81L, 81L, 81L, 79L, 79L, 84L, 84L, 84L, 80L, 80L, 80L, 80L, 80L, 80L, 80L)),
                       .Names = c("id", "time", "available", "charge"), row.names = c(NA, -25L), class = "data.frame")
setDT(intervals)

Match records immediately before and after time range

Tags:

r

data.table

dplyr

lubridate

the_darkside

1 Answers

Jaap

Recent Activity

Donate For Us

Match records immediately before and after time range

Tags:

r

data.table

dplyr

lubridate

the_darkside

1 Answers

Jaap

Related questions

Recent Activity

Donate For Us