Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove rows with overlaped dates and keep longest time interval in R using dplyr or sqldf

Tags:

r

dplyr

lubridate

I need to remove rows with overlapped dates and keep the x value which is maximum among the overlapped dates. Here is a data frame:

data.frame(time_left = c("2011-08-05",
"2011-07-25",
"2017-08-20",
"2017-08-20", 
"2017-10-09", 
"2019-06-01"), 
time_right= c("2011-09-14",
"2011-09-01",
"2017-09-12",
"2017-09-26",
"2017-10-15",
"2019-11-05"),
x = c(114,20,10,1,5,100) ) -> df

so my input is:

   time_left time_right   x
1 2011-08-05 2011-09-14 114
2 2011-07-25 2011-09-01  20
3 2017-08-20 2017-09-12  10
4 2017-08-20 2017-09-26   1
5 2017-10-09 2017-10-15   5
6 2019-06-01 2019-11-05 100

and my desired output is:

  time_left   time_right   x
1 2011-08-05 2011-09-14  114
2 2011-07-25 2011-09-01   20
4 2017-08-20 2017-09-26   10  
5 2017-10-09 2017-10-15    5
6 2019-06-01 2019-11-05  100

I appreciate any help.

like image 431
ADEN Avatar asked Oct 19 '25 15:10

ADEN


1 Answers

@Maël brought this issue to my attention over on the ivs issue page https://github.com/DavisVaughan/ivs/issues/20.

I think this can be very elegantly and efficiently solved with ivs, but it is a bit hard to come up with the solution, so I'll probably add a helper to do this more easily.

This solution works with "recursive" containers too, i.e. where range A contains range B, but then range C also contains range A, so you really only want to list range C. I've described this in more detail with examples here https://github.com/DavisVaughan/ivs/issues/20#issuecomment-1234479783.

library(ivs)
library(dplyr)
library(vctrs)

df <- tibble(
  time_left = as.Date(c(
    "2011-08-05", "2011-07-25", "2017-08-20",
    "2017-08-20", "2017-10-09", "2019-06-01"
  )),
  time_right = as.Date(c(
    "2011-09-14", "2011-09-01", "2017-09-12",
    "2017-09-26", "2017-10-15", "2019-11-05"
  )),
  x = c(114, 20, 10, 1, 5, 100)
)

df <- df %>% 
  mutate(range = iv(time_left, time_right), .keep = "unused")

df
#> # A tibble: 6 × 2
#>       x                    range
#>   <dbl>               <iv<date>>
#> 1   114 [2011-08-05, 2011-09-14)
#> 2    20 [2011-07-25, 2011-09-01)
#> 3    10 [2017-08-20, 2017-09-12)
#> 4     1 [2017-08-20, 2017-09-26)
#> 5     5 [2017-10-09, 2017-10-15)
#> 6   100 [2019-06-01, 2019-11-05)

iv_locate_max_containment <- function(x) {
  # Find all locations where the range "contains" any other range
  # (including itself)
  locs <- iv_locate_overlaps(x, x, type = "contains")
  
  # Find the "top" ranges, i.e. the containers that aren't contained
  # by any other containers
  top <- !vec_duplicate_detect(locs$haystack)
  top <- vec_slice(locs$haystack, top)
  top <- vec_in(locs$needles, top)
  
  locs <- vec_slice(locs, top)
  
  locs
}

# i.e. row 4 "contains" rows 3 and 4
locs <- iv_locate_max_containment(df$range)
locs
#>   needles haystack
#> 1       1        1
#> 2       2        2
#> 3       4        3
#> 4       4        4
#> 5       5        5
#> 6       6        6

iv_align(df$range, df$x, locations = locs) %>%
  rename(range = needles) %>%
  group_by(range) %>%
  summarise(x = max(haystack))
#> # A tibble: 5 × 2
#>                      range     x
#>                 <iv<date>> <dbl>
#> 1 [2011-07-25, 2011-09-01)    20
#> 2 [2011-08-05, 2011-09-14)   114
#> 3 [2017-08-20, 2017-09-26)    10
#> 4 [2017-10-09, 2017-10-15)     5
#> 5 [2019-06-01, 2019-11-05)   100

Created on 2022-09-01 with reprex v2.0.2

like image 192
Davis Vaughan Avatar answered Oct 21 '25 05:10

Davis Vaughan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!