Linking records with record expansion in R [duplicate]

Question

I have a large dataframe that looks something like that:

id	x
1	123
2	123
2	125
3	125
4	200

tibble(id = c(1, 2, 2, 3, 4),
       x = c(123, 123, 125, 125, 200))

I want to link my records and expand them. I.e. where they have the same x, they should have the same id. Note, record id 2 & 3 are considered same entity because of their shared link with 1. In other words, without record id 1, records 2 and 3 will not link. I'd also like to give it the smallest id value where the records link.

Results should look something like:

id	x
1	123
1	123
1	125
1	125
4	200

Allan Cameron · Accepted Answer

An equivalent way to restate this problem is "if the data frame represented the 'to' and 'from' nodes of a directed graph, what would be the first ID of each unique component of the graph?"

We can show this graphically:

library(igraph)
library(tidyverse)

df <- tibble(id = c(1, 2, 2, 3, 4),
             x = c(123, 123, 125, 125, 200))

g <- graph_from_data_frame(df)

If we plot the graph, we will see that there are two unconnected clusters of nodes (called components); one for the grouped IDs 1:3, and one for ID 4:

plot(g)

We can use the function components to work out which group each ID belongs to, and get the first ID in each group:

comp <- components(g)$membership

df %>%
  mutate(group = comp[match(id, names(comp))]) %>%
  mutate(id = first(id), .by = 'group') %>%
  select(-group)
#> # A tibble: 5 x 2
#>      id     x
#>   <dbl> <dbl>
#> 1     1   123
#> 2     1   123
#> 3     1   125
#> 4     1   125
#> 5     4   200

^{Created on 2023-09-07 with reprex v2.0.2}

Linking records with record expansion in R [duplicate]

Tags:

r

dplyr

tidyverse

record-linkage

sparklink

1 Answers

Allan Cameron

Recent Activity

Donate For Us

Linking records with record expansion in R [duplicate]

Tags:

r

dplyr

tidyverse

record-linkage

sparklink

1 Answers

Allan Cameron

Related questions

Recent Activity

Donate For Us