I would like to create a sibship network using data.table.
My data looks like this
indata <-
structure(list(id = c(1L, 2L, 3L, 4L, 12L, 13L, 14L, 15L), fid = c(NA,
9L, 1L, 1L, 7L, 5L, 5L, 5L), mid = c(0L, NA, 2L, 2L, 6L, 6L,
6L, 8L)), .Names = c("id", "fid", "mid"), class = "data.frame", row.names =
c(NA, -8L))
which is
id fid mid
1 1 NA 0
2 2 9 NA
3 3 1 2
4 4 1 2
5 12 7 6
6 13 5 6
7 14 5 6
8 15 5 8
The three columns represent id, id of the father and id of the mother, respectively. 0 or NA means not available. Thus, in the data above person 3 and 4 are full siblings (they both have father 1 and mother 2), while 12 and 13 are half siblings (they have different fathers but the same mum, 6).
For each row in the data frame I would like a list of the person's siblings (let us just consider half siblings to start with). My ideal end result would be
id fid mid sibs
1 1 NA 0 NA
2 2 9 NA NA
3 3 1 2 4
4 4 1 2 3
5 12 7 6 13, 14
6 13 5 6 12, 14, 15
7 14 5 6 12, 13, 15
8 15 5 8 13, 14
where the last column, sibs, is a list or vector (and it does not have to be part of the dataset).
A crude version to get the output using base R is given below
# get a list of offspring for each father id
foffspring <- by(indata, indata$fid, function(i) { i$id }, simplify=FALSE)
# and mother id
moffspring <- by(indata, indata$mid, function(i) { i$id }, simplify=FALSE)
To get the siblings run through each id. Find their father and mother and combine the two relevant entries from the previous lists
sibs <- sapply( 1:nrow(indata), function(i) {
res <- c()
if( !is.na(indata$fid[i]) )
res <- c(res, unlist(foffspring[paste0(indata$fid[i])]))
if( !is.na(indata$mid[i]) )
res <- c(res, unlist(moffspring[paste0(indata$mid[i])]))
unique(res[res != indata$id[i]])
}, simplify=TRUE )
This produces
> sibs
[[1]]
integer(0)
[[2]]
integer(0)
[[3]]
[1] 4
[[4]]
[1] 3
[[5]]
[1] 13 14
[[6]]
[1] 14 15 12
[[7]]
[1] 13 15 12
[[8]]
[1] 13 14
which was the desired output. Now the code above is not fast or pretty and I'd really like to see if I could get a fancy data.table version. However, my data.table-fu appears to be lacking.
library(data.table)
DT <- data.table(indata)
# Create lists with the _indices_ of the offsprings
FT <- DT[ , list( yidx = list(.I) ) , by = fid ]
MT <- DT[ , list( yidx = list(.I) ) , by = mid ]
MT looks like this
mid yidx
1: NA 2
2: 0 1
3: 2 3,4
4: 6 5,6,7
5: 8 8
Exactly like moffspring above except it contains the indices and not the labels. That, however, is not really a problem. Then I'd like to merge the tables together
setkey(DT, fid)
setkey(FT, fid)
setkey(MT, mid)
# Inner join
P1 <- DT[FT]
# And inner join on mother
setkey(P1, mid)
P1[MT]
and now the end result looks like this
id fid mid yidx i.yidx
1: 2 9 NA 2 2
2: 1 NA 0 1 1
3: 3 1 2 3,4 3,4
4: 4 1 2 3,4 3,4
5: 13 5 6 6,7,8 5,6,7
6: 14 5 6 6,7,8 5,6,7
7: 12 7 6 5 5,6,7
8: 15 5 8 6,7,8 8
This is almost there. Now if I take the row-wise union of yidx and i.yidx then I get the list of half-sibs (including the person self), and a row-wise intersection would yield the full siblings. Note that the indices here refer to the index in DT and not in the final data.table but that can be fixed as well.
However, ... I have a nagging feeling that something like this could be done much more efficiently in a few lines of data.table code and a "gentle wave of a hand". Can anyone point me in the right direction?
[Sorry for the super long post]
Update based on the answers below. Just for the fun of it I ran the three different suggestions through microbenchmark to see if there would be any timing differences among the three approaches. f1() is the suggestion by @Frank, f2() is the solution given by @mtoto, and f3 is @amatsuo_net's approach. Tried on vectors of length 1000, and here are the output
Unit: milliseconds
expr min lq mean median uq max neval cld
f1() 4020.8112 4387.7950 4614.7896 4498.8043 4770.1184 6837.672 100 c
f2() 656.9575 685.7706 727.5191 710.3003 735.2832 1080.423 100 a
f3() 1637.8927 1706.7528 1789.1794 1739.4428 1814.7776 2403.474 100 b
Quite a substantial difference in approaches. I need to run it through a dataset with 7 million id's so that certainly has a noticeable impact. Thanks all!
I would hold back on list columns as long as possible.
Starting with siblings, here's a simple approach:
sibDT = DT[!is.na(fid) & !is.na(mid),
CJ(id = id, sid = id)[id != sid]
, by=.(fid, mid)]
# fid mid id sid
# 1: 1 2 3 4
# 2: 1 2 4 3
# 3: 5 6 13 14
# 4: 5 6 14 13
And then define half siblings as sharing a parent but not appearing in sibDT:
hsibDT = melt(DT, id = "id")[!is.na(value),
CJ(id = id, hsid = id)[id != hsid]
, by=.(ptype = variable, pid = value)][!sibDT, on=.(id, hsid = sid)]
# ptype pid id hsid
# 1: fid 5 13 15
# 2: fid 5 14 15
# 3: fid 5 15 13
# 4: fid 5 15 14
# 5: mid 6 12 13
# 6: mid 6 12 14
# 7: mid 6 13 12
# 8: mid 6 14 12
I would stop here, but to browse the results with a list or character column...
DT[sibDT[, .(sibs = toString(sid)), by=id], on=.(id), sibs := i.sibs, by=.EACHI ]
DT[hsibDT[, .(hsibs = toString(hsid)), by=id], on=.(id), hsibs := i.hsibs, by=.EACHI ]
# or...
DT[
rbind(sibDT[, .(id, oid = sid)], hsibDT[, .(id, oid = hsid)])[,
.(asibs = toString(oid))
, by=.(id)],
on = .(id),
asibs := i.asibs
, by = .EACHI]
which gives
id fid mid sibs hsibs asibs
1: 1 NA 0 NA NA NA
2: 2 9 NA NA NA NA
3: 3 1 2 4 NA 4
4: 4 1 2 3 NA 3
5: 12 7 6 NA 13, 14 13, 14
6: 13 5 6 14 15, 12 14, 15, 12
7: 14 5 6 13 15, 12 13, 15, 12
8: 15 5 8 NA 13, 14 13, 14
Adding these columns to DT is counterproductive unless your analysis is complete. I guess any useful analysis would be on the non-list columns contained in the various tables.
Here's an approach using mapply() in combination with setdiff() and union(). After collecting the id's into a list, first we exclude the current id, and then union() the lists from both sides:
setDT(indata)[,msib:=.(list(id)), by = "mid"][
,msibs := mapply(setdiff, msib, id)][
,fsib := .(list(id)), by = "fid"][
,fsibs := mapply(setdiff, fsib, id)][
,sibs := mapply(union, msibs, fsibs)][
,c("msib","msibs", "fsib", "fsibs") := NULL]
> indata
# id fid mid sibs
#1: 1 NA 0
#2: 2 9 NA
#3: 3 1 2 4
#4: 4 1 2 3
#5: 12 7 6 13,14
#6: 13 5 6 12,14,15
#7: 14 5 6 12,13,15
#8: 15 5 8 13,14
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With