Assume that I have several years of personal data, in which a Personal ID (indiv) identifies individuals across years, but the household ID (househ) only assigns household relation within a year.
Example:
libary(dplyr)
test_01 <- data.frame(indiv=c(1,2,3,4,5,6),househ=c(1,1,2,3,4,4),time=rep(1,6)) #1&2 form a household, 3 and 4 are single, 5&6 form a household
test_02 <- data.frame(indiv=c(2,3,4,5,6,7),househ=c(1,2,2,3,3,4),time=rep(2,6)) #1 exits, so 2 is now a new household, 3&4 now form a new household, 5&6 still do, 7 enters
test_03 <- data.frame(indiv=c(2,3,4,5,7,8,9,10),househ=c(1,2,2,3,4,5,5,6),time=rep(3,8)) #according to logic above
data_test_panel <- bind_rows(test_01,test_02,test_03)
The desired time-consistent household variable would be:
data_test_panel$true_household <- c(1,1,2,3,4,4,5,6,6,4,4,7,5,6,6,8,7,9,9,10)
So far I tried:
library(data.table)
setDT(data_test_panel)[,cons_household := .GRP,.(time,househ)] # where cons_household is the new household ID. However, this doesn't give the same household ID across time but assigns new values for every appearance of a household.
Thank you kindly for all your help! /Severin
Create a list column of unique individuals in a given household at each time period. We can then take the group number by that column.
dplyr approachThis is much nicer than the data.table approach:
library(dplyr)
data_test_panel |>
mutate(
indiv_in_household = list(indiv),
.by = c(househ, time)
) |>
mutate(
cons_household = cur_group_id(),
.by = indiv_in_household
)
Note this uses per-operation grouping with the .by parameter which requires at least dplyr v1.1.0.
data.table approachThe syntax for this is much more complicated because data.table will automatically unroll your list column into a vector if you try to make it list(indiv) by group.
data_test_panel[, indiv_in_household := .SD[
, .(split(
rep(indiv, length(indiv)),
rep(seq(length(indiv)), each = length(indiv))
)),
.(househ, time)
]$V1]
head(data_test_panel)
# indiv househ time true_household indiv_in_household
# <num> <num> <num> <num> <list>
# 1: 1 1 1 1 1,2
# 2: 2 1 1 1 1,2
# 3: 3 2 1 2 3
# 4: 4 3 1 3 4
# 5: 5 4 1 4 5,6
# 6: 6 4 1 4 5,6
Also data.table doesn't support grouping by list columns so we need to use toString(), which is OK in this case but in general is not ideal as you have to think through the format of the column to ensure it will not somehow cause ambiguity leading two distinct groups to be merged.
data_test_panel[, cons_household := .GRP, sapply(indiv_in_household, toString)]
data_test_panel[cons_household != true_household]
# Empty data.table (0 rows and 6 cols): indiv,househ,time,true_household,cons_household,indiv_in_household
data.table approach that is reasonably niceAs the previous answer shows, data.table's handling of list columns doesn't lend itself to the dplyr-style solution in which the list column of members per annual household is broadcast to all rows of the data, and then grouped-by to create the IDs.
However, we can use the following: (1) Collapse (instead of broadcasting) to a list column with a single row per household/year. (2) Create the ID as the index of each annual household in a list of unique households from all years. (3) Uncollapse the list column.
(data_test_panel
[, .(hh_indivs = list(indiv)), by=.(househ, time) # collapse
][, hh_id := match(hh_indivs, unique(hh_indivs)) # index
][, .(indiv = unlist(hh_indivs), hh_id), by=.(househ, time)]) # uncollapse
This avoids the complicated code to mimic the broadcasted list column, and the problematic use of toString() to enable a groupby.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With