In R/dplyr help file , there are have below code as attached which have within
and overlaps
, how to understand this two key words ? Thanks!
library(dplyr)
segments <- tibble(
segment_id = 1:4,
chromosome = c("chr1", "chr2", "chr2", "chr1"),
start = c(140, 210, 380, 230),
end = c(150, 240, 415, 280)
)
reference <- tibble(
reference_id = 1:4,
chromosome = c("chr1", "chr1", "chr2", "chr2"),
start = c(100, 200, 300, 415),
end = c(150, 250, 399, 450)
)
sample 1: within
by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, by)
sample 2: overlaps
by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))
full_join(segments, reference, by)
within
only captures rows if the range in x is entirely in the range of y.
overlaps
capture rows if there is any type of overlap between the range of x and y. BUT it does not capture rows that are entirely within, i.e. if x_lower > y_lower & x_upper < y_upper
.
It might be easier to understand like this (note this uses overlap
's default bound: "[]"
)
Example:
x_lower = c(1, 10, 5, 10)
x_upper = c(4, 25, 6, 15)
y_lower = c(0, 15, 10, 3)
y_upper = c(10, 16, 20, 30)
df <- data.frame(x_lower, x_upper, y_lower, y_upper)
transform(df,
is_within = x_lower >= y_lower & x_upper <= y_upper,
is_overlap = x_lower <= y_lower & x_upper >= y_upper)
# x_lower x_upper y_lower y_upper is_within is_overlap
# 1 1 4 0 10 TRUE FALSE
# 2 10 25 15 16 FALSE TRUE
# 3 5 6 10 20 FALSE FALSE
# 4 10 15 3 30 TRUE FALSE
From the documentation:
within(x_lower, x_upper, y_lower, y_upper)
For each range in [x_lower, x_upper], this finds everywhere that range falls completely within [y_lower, y_upper]. Equivalent to x_lower >= y_lower, x_upper <= y_upper.
And
overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")
For each range in [x_lower, x_upper], this finds everywhere that range overlaps [y_lower, y_upper] in any capacity. Equivalent to x_lower <= y_upper, x_upper >= y_lower by default.
The documentation of join_by
actually covers these two helper functions.
For within
(my bold):
For each range in [x_lower, x_upper], this finds everywhere that range falls completely within [y_lower, y_upper]. Equivalent to x_lower >= y_lower, x_upper <= y_upper.
The inequalities used to build within() are the same regardless of the inclusiveness of the supplied ranges.
library(dplyr)
full_join(segments, reference, by = "chromosome")
# A tibble: 8 × 7
segment_id chromosome start.x end.x reference_id start.y end.y
<int> <chr> <dbl> <dbl> <int> <dbl> <dbl>
1 1 chr1 140 150 1 100 150 # yes
2 1 chr1 140 150 2 200 250 # both x smaller than y
3 2 chr2 210 240 3 300 399 # both x smaller than y
4 2 chr2 210 240 4 415 450 # both x smaller than y
5 3 chr2 380 415 3 300 399 # x$end (415) outside range
6 3 chr2 380 415 4 415 450 # x$start (380) outside range
7 4 chr1 230 280 1 100 150 # both x greater than y
8 4 chr1 230 280 2 200 250 # x$end (280) outside range
Therefore, join_by(within())
gives:
by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, join_by(chromosome, within(x$start, x$end, y$start, y$end)))
# A tibble: 1 × 7
segment_id chromosome start.x end.x reference_id start.y end.y
<int> <chr> <dbl> <dbl> <int> <dbl> <dbl>
1 1 chr1 140 150 1 100 150
For overlaps
(my bold):
For each range in [x_lower, x_upper], this finds everywhere that range overlaps [y_lower, y_upper] in any capacity. Equivalent to x_lower <= y_upper, x_upper >= y_lower by default.
bounds can be one of "[]", "[)", "(]", or "()" to alter the inclusiveness of the lower and upper bounds. "[]" uses <= and >=, but the 3 other options use < and > and generate the exact same inequalities.
# A tibble: 8 × 7
segment_id chromosome start.x end.x reference_id start.y end.y
<int> <chr> <dbl> <dbl> <int> <dbl> <dbl>
1 1 chr1 140 150 1 100 150 # yes
2 1 chr1 140 150 2 200 250 # x$end (150) smaller than y$start (200)
3 2 chr2 210 240 3 300 399 # x$end (240) smaller than y$start (300)
4 2 chr2 210 240 4 415 450 # x$end (240) smaller than y$start (415)
5 3 chr2 380 415 3 300 399 # yes
6 3 chr2 380 415 4 415 450 # yes
7 4 chr1 230 280 1 100 150 # x$start (230) > y$end (150)
8 4 chr1 230 280 2 200 250 # yes
Therefore, join_by(overlaps())
gives:
# A tibble: 5 × 7
segment_id chromosome start.x end.x reference_id start.y end.y
<int> <chr> <dbl> <dbl> <int> <dbl> <dbl>
1 1 chr1 140 150 1 100 150
2 2 chr2 210 240 NA NA NA
3 3 chr2 380 415 3 300 399
4 3 chr2 380 415 4 415 450
5 4 chr1 230 280 2 200 250
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With