I want to know how best to check for duplicates but in a way that ignores NA values--i.e., if a value is compared to an NA, it is considered a match.
For example, if I have a dataset like such
dt <- data.table(col1 = c("A", "B", NA, "A", "C", "C"),
col2 = c("Z", "Y", "Z", NA, "X", "X"),
col3 = c(1, 2, 1, 1, 3, 3))
dt
col1 col2 col3
1: A Z 1
2: B Y 2
3: <NA> Z 1
4: A <NA> 1
5: C X 3
6: C X 3
I want row 5 and 6 to be flagged as a duplicate pair, but also rows 1 and 4. All the approaches I know only flag row 5 and 6.
e.g.,
dupes_flagged <- dt %>%
group_by(pick(1:3)) %>%
mutate(
dupe_group = cur_group_id()
)
dupes_flagged
col1 col2 col3 dupe_group
1: A Z 1 1
2: B Y 2 3
3: <NA> Z 1 5
4: A <NA> 1 2
5: C X 3 4
6: C X 3 4
dt <- dt %>% as.data.table()
dup = duplicated(dt, by = c("col1", "col2", "col3"))
dt[, dupe_group := dup | c(tail(dup, -1), FALSE)]
dt
col1 col2 col3 dupe_group
1: A Z 1 FALSE
2: B Y 2 FALSE
3: <NA> Z 1 FALSE
4: A <NA> 1 FALSE
5: C X 3 TRUE
6: C X 3 TRUE
Introducing NA as being a wildcard, you'd need to compare all pairs of rows to identify duplicates: you can use combn to apply a function to all pairs of rows. To compare row-elements, you can use ==, and check if all elements are equal, discarding NAs (na.rm = FALSE).
cb <- combn(seq(nrow(dt)), m = 2, FUN = \(x){
all(dt[x[1], ] == dt[x[2], ], na.rm = TRUE)
})
mat <- matrix(nrow = nrow(dt), ncol = nrow(dt))
mat[lower.tri(mat)] <- cb
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] NA NA NA NA NA NA
# [2,] FALSE NA NA NA NA NA
# [3,] TRUE FALSE NA NA NA NA
# [4,] TRUE FALSE TRUE NA NA NA
# [5,] FALSE FALSE FALSE FALSE NA NA
# [6,] FALSE FALSE FALSE FALSE TRUE NA
Here, we see for instance that row 1 is duplicate with row 3 and 4, but not row 2.
What you eventually need is to know whether a row is duplicate with any other row:
colSums(mat, na.rm = TRUE) > 0 | rowSums(mat, na.rm = TRUE) > 0
#[1] TRUE FALSE TRUE TRUE TRUE TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With