I have a data frame of 205,000+ rows formatted as follows:
df <- data.frame(project.id = c('SP001', 'SP001', 'SP001', 'SP017', 'SP018', 'SP017'),
supplier.id = c('1224', '5542', '7741', '1224', '2020', '9122'))
In the actual data frame there are 6700+ unique values of project.id. I would like to create an edge list that pairs suppliers who have worked on the same project.
Desired end result for project.id = SP001:
to from
1224 5542
1224 7741
5542 7741
So far I've tried using split to create a list by project.id and then running lapply+combn to generate all possible combinations of supplier.id within each list/group:
try.list <- split(df, df$project.id)
try.output <- lapply(try.list, function(x) combn(x$supplier.id, 2))
Is there a more elegant/efficient (read "computed in less than 2hrs") way to generate something like this?
Any help would be much appreciated
Instead of using split and lapply, you can use the dplyr package.
df <- data.frame(project.id = c('SP001', 'SP001', 'SP001', 'SP017', 'SP018', 'SP017'),
supplier.id = c('1224', '5542', '7741', '1224', '2020', '9122'),
stringsAsFactors = FALSE)
library(dplyr)
df %>% group_by(project.id) %>%
filter(n()>=2) %>% group_by(project.id) %>%
do(data.frame(t(combn(.$supplier.id, 2)), stringsAsFactors=FALSE))
# Source: local data frame [4 x 3]
# Groups: project.id [2]
# project.id X1 X2
# (chr) (chr) (chr)
# 1 SP001 1224 5542
# 2 SP001 1224 7741
# 3 SP001 5542 7741
# 4 SP017 1224 9122
You can just merge it with itself which gets you all the Cartesian pairs:
temp <- merge(df,df, by="project.id")
res <- temp[ temp$supplier.id.x != temp$supplier.id.y , ]
> res
project.id supplier.id.x supplier.id.y
2 SP001 1224 5542
3 SP001 1224 7741
4 SP001 5542 1224
6 SP001 5542 7741
7 SP001 7741 1224
8 SP001 7741 5542
11 SP017 1224 9122
12 SP017 9122 1224
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With