Sibling network using data.table

Question

I would like to create a sibship network using data.table.

My data looks like this

indata <-
structure(list(id = c(1L, 2L, 3L, 4L, 12L, 13L, 14L, 15L), fid = c(NA, 
9L, 1L, 1L, 7L, 5L, 5L, 5L), mid = c(0L, NA, 2L, 2L, 6L, 6L, 
6L, 8L)), .Names = c("id", "fid", "mid"), class = "data.frame", row.names = 
c(NA, -8L))

which is

  id fid mid
1  1  NA   0
2  2   9  NA
3  3   1   2
4  4   1   2
5 12   7   6
6 13   5   6
7 14   5   6
8 15   5   8

The three columns represent id, id of the father and id of the mother, respectively. 0 or NA means not available. Thus, in the data above person 3 and 4 are full siblings (they both have father 1 and mother 2), while 12 and 13 are half siblings (they have different fathers but the same mum, 6).

For each row in the data frame I would like a list of the person's siblings (let us just consider half siblings to start with). My ideal end result would be

  id fid mid sibs
1  1  NA   0 NA
2  2   9  NA NA
3  3   1   2 4
4  4   1   2 3
5 12   7   6 13, 14
6 13   5   6 12, 14, 15
7 14   5   6 12, 13, 15
8 15   5   8 13, 14

where the last column, sibs, is a list or vector (and it does not have to be part of the dataset).

A crude version to get the output using base R is given below

# get a list of offspring for each father id
foffspring <- by(indata, indata$fid, function(i) { i$id }, simplify=FALSE) 
# and mother id
moffspring <- by(indata, indata$mid, function(i) { i$id }, simplify=FALSE)

To get the siblings run through each id. Find their father and mother and combine the two relevant entries from the previous lists

sibs <- sapply( 1:nrow(indata), function(i) {
    res <- c()
    if( !is.na(indata$fid[i]) ) 
        res <- c(res, unlist(foffspring[paste0(indata$fid[i])]))
    if( !is.na(indata$mid[i]) ) 
        res <- c(res, unlist(moffspring[paste0(indata$mid[i])]))
    unique(res[res != indata$id[i]]) 
    }, simplify=TRUE )

This produces

> sibs
[[1]]
integer(0)

[[2]]
integer(0)

[[3]]
[1] 4

[[4]]
[1] 3

[[5]]
[1] 13 14

[[6]]
[1] 14 15 12

[[7]]
[1] 13 15 12

[[8]]
[1] 13 14

which was the desired output. Now the code above is not fast or pretty and I'd really like to see if I could get a fancy data.table version. However, my data.table-fu appears to be lacking.

library(data.table)
DT <- data.table(indata)
# Create lists with the _indices_ of the offsprings
FT <- DT[ , list( yidx = list(.I) ) , by = fid ]
MT <- DT[ , list( yidx = list(.I) ) , by = mid ]

MT looks like this

   mid  yidx
1:  NA     2
2:   0     1
3:   2   3,4
4:   6 5,6,7
5:   8     8

Exactly like moffspring above except it contains the indices and not the labels. That, however, is not really a problem. Then I'd like to merge the tables together

setkey(DT, fid)
setkey(FT, fid)
setkey(MT, mid)

# Inner join
P1 <- DT[FT]

# And inner join on mother
setkey(P1, mid)
P1[MT]

and now the end result looks like this

   id fid mid  yidx i.yidx
1:  2   9  NA     2      2
2:  1  NA   0     1      1
3:  3   1   2   3,4    3,4
4:  4   1   2   3,4    3,4
5: 13   5   6 6,7,8  5,6,7
6: 14   5   6 6,7,8  5,6,7
7: 12   7   6     5  5,6,7
8: 15   5   8 6,7,8      8

This is almost there. Now if I take the row-wise union of yidx and i.yidx then I get the list of half-sibs (including the person self), and a row-wise intersection would yield the full siblings. Note that the indices here refer to the index in DT and not in the final data.table but that can be fixed as well.

However, ... I have a nagging feeling that something like this could be done much more efficiently in a few lines of data.table code and a "gentle wave of a hand". Can anyone point me in the right direction?

[Sorry for the super long post]

Update based on the answers below. Just for the fun of it I ran the three different suggestions through microbenchmark to see if there would be any timing differences among the three approaches. f1() is the suggestion by @Frank, f2() is the solution given by @mtoto, and f3 is @amatsuo_net's approach. Tried on vectors of length 1000, and here are the output

Unit: milliseconds
 expr       min        lq      mean    median        uq      max neval cld
 f1() 4020.8112 4387.7950 4614.7896 4498.8043 4770.1184 6837.672   100   c
 f2()  656.9575  685.7706  727.5191  710.3003  735.2832 1080.423   100 a  
 f3() 1637.8927 1706.7528 1789.1794 1739.4428 1814.7776 2403.474   100  b

Quite a substantial difference in approaches. I need to run it through a dataset with 7 million id's so that certainly has a noticeable impact. Thanks all!

Frank · Accepted Answer

I would hold back on list columns as long as possible.

Starting with siblings, here's a simple approach:

sibDT = DT[!is.na(fid) & !is.na(mid), 
  CJ(id = id, sid = id)[id != sid]
, by=.(fid, mid)]

#    fid mid id sid
# 1:   1   2  3   4
# 2:   1   2  4   3
# 3:   5   6 13  14
# 4:   5   6 14  13

And then define half siblings as sharing a parent but not appearing in sibDT:

hsibDT = melt(DT, id = "id")[!is.na(value), 
  CJ(id = id, hsid = id)[id != hsid]
, by=.(ptype = variable, pid = value)][!sibDT, on=.(id, hsid = sid)]

#    ptype pid id hsid
# 1:   fid   5 13   15
# 2:   fid   5 14   15
# 3:   fid   5 15   13
# 4:   fid   5 15   14
# 5:   mid   6 12   13
# 6:   mid   6 12   14
# 7:   mid   6 13   12
# 8:   mid   6 14   12

I would stop here, but to browse the results with a list or character column...

DT[sibDT[, .(sibs = toString(sid)), by=id], on=.(id), sibs := i.sibs, by=.EACHI ]
DT[hsibDT[, .(hsibs = toString(hsid)), by=id], on=.(id), hsibs := i.hsibs, by=.EACHI ]

# or...

DT[
  rbind(sibDT[, .(id, oid = sid)], hsibDT[, .(id, oid = hsid)])[, 
    .(asibs = toString(oid))
  , by=.(id)], 
  on = .(id),
  asibs := i.asibs
, by = .EACHI]

which gives

   id fid mid sibs  hsibs      asibs
1:  1  NA   0   NA     NA         NA
2:  2   9  NA   NA     NA         NA
3:  3   1   2    4     NA          4
4:  4   1   2    3     NA          3
5: 12   7   6   NA 13, 14     13, 14
6: 13   5   6   14 15, 12 14, 15, 12
7: 14   5   6   13 15, 12 13, 15, 12
8: 15   5   8   NA 13, 14     13, 14

Adding these columns to DT is counterproductive unless your analysis is complete. I guess any useful analysis would be on the non-list columns contained in the various tables.

mtoto · Answer

Here's an approach using mapply() in combination with setdiff() and union(). After collecting the id's into a list, first we exclude the current id, and then union() the lists from both sides:

setDT(indata)[,msib:=.(list(id)), by = "mid"][
  ,msibs := mapply(setdiff, msib, id)][
  ,fsib  := .(list(id)), by = "fid"][
  ,fsibs := mapply(setdiff, fsib, id)][
  ,sibs  := mapply(union, msibs, fsibs)][
  ,c("msib","msibs", "fsib", "fsibs") := NULL]
> indata
#   id fid mid     sibs
#1:  1  NA   0         
#2:  2   9  NA         
#3:  3   1   2        4
#4:  4   1   2        3
#5: 12   7   6    13,14
#6: 13   5   6 12,14,15
#7: 14   5   6 12,13,15
#8: 15   5   8    13,14

Sibling network using data.table

Tags:

r

data.table

ekstroem

2 Answers

Frank

mtoto

Recent Activity

Donate For Us

Sibling network using data.table

Tags:

r

data.table

ekstroem

2 Answers

Frank

mtoto

Related questions

Recent Activity

Donate For Us