I have a large dataframe that looks something like that:
| id | x |
|---|---|
| 1 | 123 |
| 2 | 123 |
| 2 | 125 |
| 3 | 125 |
| 4 | 200 |
tibble(id = c(1, 2, 2, 3, 4),
x = c(123, 123, 125, 125, 200))
I want to link my records and expand them. I.e. where they have the same x, they should have the same id. Note, record id 2 & 3 are considered same entity because of their shared link with 1. In other words, without record id 1, records 2 and 3 will not link. I'd also like to give it the smallest id value where the records link.
Results should look something like:
| id | x |
|---|---|
| 1 | 123 |
| 1 | 123 |
| 1 | 125 |
| 1 | 125 |
| 4 | 200 |
An equivalent way to restate this problem is "if the data frame represented the 'to' and 'from' nodes of a directed graph, what would be the first ID of each unique component of the graph?"
We can show this graphically:
library(igraph)
library(tidyverse)
df <- tibble(id = c(1, 2, 2, 3, 4),
x = c(123, 123, 125, 125, 200))
g <- graph_from_data_frame(df)
If we plot the graph, we will see that there are two unconnected clusters of nodes (called components); one for the grouped IDs 1:3, and one for ID 4:
plot(g)

We can use the function components to work out which group each ID belongs to, and get the first ID in each group:
comp <- components(g)$membership
df %>%
mutate(group = comp[match(id, names(comp))]) %>%
mutate(id = first(id), .by = 'group') %>%
select(-group)
#> # A tibble: 5 x 2
#> id x
#> <dbl> <dbl>
#> 1 1 123
#> 2 1 123
#> 3 1 125
#> 4 1 125
#> 5 4 200
Created on 2023-09-07 with reprex v2.0.2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With