I have a genetic dataset where I am looking to order samples/genes, grouping by those which are in a certain distance to each other in the genome.
So for example my dataset looks like:
#dt1
Gene chromosome position CP
Gene1 1 70000200 1:70000200
Gene2 5 10000476 5:10000476
Gene3 1 70000201 1:70000201
Gene4 5 10000475 5:10000475
I also have an origin position dataset:
#dt2
chromosome position CP
1 70005000 1:70005000
5 10005000 5:10005000
I am trying to group genes in my 1st dataset if they are within +/- 500000 distance of any position in my second dt2 dataset and are on the same chromosome. I have an issue in my actual data where this might be true for a gene against multiple origin dt2 positions, so I'm also trying to sort to the one it is closest to.
Output aims to give ordered groups:
Gene chromosome position Group
Gene1 1 70000200 1
Gene3 1 70000201 1
Gene4 5 10000475 2
Gene2 5 10000476 2
Gene1 and Gene3 are within the 500000 of an origin dt2 position and all are on the same chromosome so grouped together and the same for Genes 4 and 2
Currently I am trying to do this with:
dt2[, c("low", "high") := .(position - 500000, position + 500000)]
#find matches on chromosome, with position between low&high
dt1[ dt2, match := i.CP,
on = .(chromosome, position > low, position < high ) ]
#outputs:
Gene chromosome position CP match
1 Gene1 1 70000200 1:70000200 1:70005000
2 Gene2 5 10000476 5:10000476 5:10005000
3 Gene3 1 70000201 1:70000201 1:70005000
4 Gene4 5 10000475 5:10000475 5:10005000
I am having problems with this on 2 levels with seemingly not getting this output for the match column on my actual data, so I am wondering if there are other ways to code this that I can try. I am also struggling to convert the match column into grouping matches and identifying the groups as I want in my expected output, I have a biology background so I'm unsure how to change this - any help would be appreciated.
Input data:
#dt1:
structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4"),
chromosome = c(1L, 5L, 1L, 5L), position = c(70000200L, 10000476L,
70000201L, 10000475L), CP = c("1:70000200", "5:10000476",
"1:70000201", "5:10000475"), match = c("1:70005000", "5:10005000",
"1:70005000", "5:10005000")), row.names = c(NA, -4L), class = c("data.table",
"data.frame"))
#dt2:
structure(list(chromosome = c(1L, 5L), position = c(70005000L,
10005000L), CP = c("1:70005000", "5:10005000"), low = c(69505000,
9505000), high = c(70505000, 10505000)), row.names = c(NA, -2L
), class = c("data.table", "data.frame"))
There are packages geared for doing this in Bioconductor. So one thing you can use is distanceToNearest() from GenomicRanges . First we convert them to GRanges objects:
library(GenomicRanges)
gr1=makeGRangesFromDataFrame(dt1,seqnames.field="chromosome",start.field="position",end.field="position")
values(gr1) = dt1[,c("Gene","CP")]
Give a group for dt2:
dt2$Group = 1:nrow(dt2)
gr2=makeGRangesFromDataFrame(dt2,seqnames.field="chromosome",start.field="position",end.field="position")
This step will match every row in gr1 (dt1 GRanges) to its nearest Range in gr2 (dt2):
matches = distanceToNearest(gr1,gr2)
Hits object with 4 hits and 1 metadata column:
queryHits subjectHits | distance
<integer> <integer> | <integer>
[1] 1 1 | 4799
[2] 2 2 | 4523
[3] 3 1 | 4798
[4] 4 2 | 4524
-------
queryLength: 4 / subjectLength: 2
We assign this result back:
dt1$group = NA
dt1$group[queryHits(matches)] = dt2$Group[subjectHits(matches)]
dt1$distance = NA
dt1$distance[queryHits(matches)] = values(matches)$distance[subjectHits(matches)]
dt1
Gene chromosome position CP match group distance
1: Gene1 1 70000200 1:70000200 1:70005000 1 4799
2: Gene2 5 10000476 5:10000476 5:10005000 2 4523
3: Gene3 1 70000201 1:70000201 1:70005000 1 4799
4: Gene4 5 10000475 5:10000475 5:10005000 2 4523
Now you can filter away those that are > 500000
If I understand correctly, the OP wants
dt1 if they are within +/- 500000 distance of any position in dt2 and are on the same chromosome.dt2 positions, the closest one is to be selected.So, a rolling join to nearest with subsequent filtering might be the answer.
library(data.table)
dt2[, Group := .I][
dt1, on = .(chromosome, position), roll = "nearest",
.(Gene, chromosome, position, CP = i.CP,
Group = fifelse(abs(x.position - i.position) <= 500000L, Group, NA_integer_))][
order(Group, CP)]
Gene chromosome position CP Group 1: Gene1 1 70000200 1:70000200 1 2: Gene3 1 70000201 1:70000201 1 3: Gene12 1 70005199 1:70005199 1 4: Gene13 1 70005900 1:70005900 2 5: Gene4 5 10000475 5:10000475 3 6: Gene2 5 10000476 5:10000476 3 7: Gene11 1 80000200 1:80000200 NA
Please, note that expanded datasets are used here in order to check if the approach is working for different use cases.
There is one gene, Gene11, in the expanded dataset which has been assigned to no group, i.e., Group == NA, because it has no matching position in dt2 within the distance threshold of +/- 500000.
The approach is similar to StupidWolf's GenomicRanges answer but uses only data.table functionality.
Both datasets have been expanded in order to cover different uses cases:
library(data.table)
dt1 <- fread("Gene chromosome position CP
Gene1 1 70000200 1:70000200
Gene2 5 10000476 5:10000476
Gene3 1 70000201 1:70000201
Gene4 5 10000475 5:10000475
Gene11 1 80000200 1:80000200
Gene12 1 70005199 1:70005199
Gene13 1 70005900 1:70005900")
dt2 <- fread("chromosome position CP
1 70005000 1:70005000
1 70006000 1:70006000
5 10005000 5:10005000")
I think you are looking for the idiomatic way of performing a non-equi join and then update by reference, i.e.
dt2[, c("rn", "low", "high") := .(.I, position - 500000L, position + 500000L)]
#note that you perform the non-equi join first, and then
#extract the result column before `:=`, which updates by reference
dt1[, Group := dt2[.SD, on=.(chromosome, low<position, high>position), rn]]
dt1
edit regarding multiple matches. In this case you will require a left join:
dt2[, group := .I][dt1, on=.(chromosome, low<position, high>position)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With