Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The usage of the key word 'within' and 'overlaps' in join_by

Tags:

join

r

dplyr

In R/dplyr help file , there are have below code as attached which have within and overlaps, how to understand this two key words ? Thanks!

library(dplyr)

segments <- tibble(
  segment_id = 1:4,
  chromosome = c("chr1", "chr2", "chr2", "chr1"),
  start = c(140, 210, 380, 230),
  end = c(150, 240, 415, 280)
)


reference <- tibble(
  reference_id = 1:4,
  chromosome = c("chr1", "chr1", "chr2", "chr2"),
  start = c(100, 200, 300, 415),
  end = c(150, 250, 399, 450)
)

sample 1: within

by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, by)

sample 2: overlaps

by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))
full_join(segments, reference, by)
like image 992
anderwyang Avatar asked Sep 06 '25 03:09

anderwyang


2 Answers

within only captures rows if the range in x is entirely in the range of y.

overlaps capture rows if there is any type of overlap between the range of x and y. BUT it does not capture rows that are entirely within, i.e. if x_lower > y_lower & x_upper < y_upper.

It might be easier to understand like this (note this uses overlap's default bound: "[]")

enter image description here

Example:

x_lower = c(1, 10, 5, 10)
x_upper = c(4, 25, 6, 15)

y_lower = c(0, 15, 10, 3)
y_upper = c(10, 16, 20, 30)

df <- data.frame(x_lower, x_upper, y_lower, y_upper)
transform(df, 
          is_within = x_lower >= y_lower & x_upper <= y_upper,
          is_overlap = x_lower <= y_lower & x_upper >= y_upper)

#   x_lower x_upper y_lower y_upper is_within is_overlap
# 1       1       4       0      10      TRUE      FALSE
# 2      10      25      15      16     FALSE       TRUE
# 3       5       6      10      20     FALSE      FALSE
# 4      10      15       3      30      TRUE      FALSE

From the documentation:

within(x_lower, x_upper, y_lower, y_upper)

For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that range falls completely within ⁠[y_lower, y_upper]⁠. Equivalent to ⁠x_lower >= y_lower, x_upper <= y_upper⁠.

And

overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")

For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that range overlaps ⁠[y_lower, y_upper]⁠ in any capacity. Equivalent to ⁠x_lower <= y_upper, x_upper >= y_lower⁠ by default.

like image 137
Maël Avatar answered Sep 08 '25 00:09

Maël


The documentation of join_by actually covers these two helper functions.

For within (my bold):

For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that range falls completely within ⁠[y_lower, y_upper]⁠. Equivalent to ⁠x_lower >= y_lower, x_upper <= y_upper⁠.

The inequalities used to build within() are the same regardless of the inclusiveness of the supplied ranges.

library(dplyr)

full_join(segments, reference, by = "chromosome")
# A tibble: 8 × 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
1          1 chr1           140   150            1     100   150 # yes
2          1 chr1           140   150            2     200   250 # both x smaller than y
3          2 chr2           210   240            3     300   399 # both x smaller than y
4          2 chr2           210   240            4     415   450 # both x smaller than y
5          3 chr2           380   415            3     300   399 # x$end (415) outside range
6          3 chr2           380   415            4     415   450 # x$start (380) outside range
7          4 chr1           230   280            1     100   150 # both x greater than y
8          4 chr1           230   280            2     200   250 # x$end (280) outside range

Therefore, join_by(within()) gives:

by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, join_by(chromosome, within(x$start, x$end, y$start, y$end)))

# A tibble: 1 × 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
1          1 chr1           140   150            1     100   150

For overlaps (my bold):

For each range in ⁠[x_lower, x_upper]⁠, this finds everywhere that range overlaps ⁠[y_lower, y_upper]⁠ in any capacity. Equivalent to ⁠x_lower <= y_upper, x_upper >= y_lower⁠ by default.

bounds can be one of "[]", "[)", "(]", or "()" to alter the inclusiveness of the lower and upper bounds. "[]" uses <= and >=, but the 3 other options use < and > and generate the exact same inequalities.

# A tibble: 8 × 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
1          1 chr1           140   150            1     100   150 # yes
2          1 chr1           140   150            2     200   250 # x$end (150) smaller than y$start (200)
3          2 chr2           210   240            3     300   399 # x$end (240) smaller than y$start (300)
4          2 chr2           210   240            4     415   450 # x$end (240) smaller than y$start (415)
5          3 chr2           380   415            3     300   399 # yes 
6          3 chr2           380   415            4     415   450 # yes
7          4 chr1           230   280            1     100   150 # x$start (230) > y$end (150)
8          4 chr1           230   280            2     200   250 # yes

Therefore, join_by(overlaps()) gives:

# A tibble: 5 × 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
1          1 chr1           140   150            1     100   150
2          2 chr2           210   240           NA      NA    NA
3          3 chr2           380   415            3     300   399
4          3 chr2           380   415            4     415   450
5          4 chr1           230   280            2     200   250
like image 37
benson23 Avatar answered Sep 08 '25 02:09

benson23