Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check for duplicates while ignoring NAs (NA to value considered match)

I want to know how best to check for duplicates but in a way that ignores NA values--i.e., if a value is compared to an NA, it is considered a match.

For example, if I have a dataset like such

dt <- data.table(col1 = c("A", "B", NA, "A", "C", "C"), 
                 col2 = c("Z", "Y", "Z", NA, "X", "X"), 
                 col3 = c(1, 2, 1, 1, 3, 3))

dt

   col1 col2 col3
1:    A    Z    1
2:    B    Y    2
3: <NA>    Z    1
4:    A <NA>    1
5:    C    X    3
6:    C    X    3

I want row 5 and 6 to be flagged as a duplicate pair, but also rows 1 and 4. All the approaches I know only flag row 5 and 6.

e.g.,

dupes_flagged <- dt %>%
  group_by(pick(1:3)) %>%
  mutate(
    dupe_group = cur_group_id()
  )

dupes_flagged
   col1 col2 col3 dupe_group
1:    A    Z    1          1
2:    B    Y    2          3
3: <NA>    Z    1          5
4:    A <NA>    1          2
5:    C    X    3          4
6:    C    X    3          4
dt <- dt %>% as.data.table()
dup = duplicated(dt, by = c("col1", "col2", "col3"))
dt[, dupe_group := dup | c(tail(dup, -1), FALSE)]

dt
   col1 col2 col3 dupe_group
1:    A    Z    1      FALSE
2:    B    Y    2      FALSE
3: <NA>    Z    1      FALSE
4:    A <NA>    1      FALSE
5:    C    X    3       TRUE
6:    C    X    3       TRUE
like image 835
Sean Ryan Avatar asked Oct 22 '25 10:10

Sean Ryan


1 Answers

Introducing NA as being a wildcard, you'd need to compare all pairs of rows to identify duplicates: you can use combn to apply a function to all pairs of rows. To compare row-elements, you can use ==, and check if all elements are equal, discarding NAs (na.rm = FALSE).

cb <- combn(seq(nrow(dt)), m = 2, FUN = \(x){
  all(dt[x[1], ] == dt[x[2], ], na.rm = TRUE)
})
mat <- matrix(nrow = nrow(dt), ncol = nrow(dt))
mat[lower.tri(mat)] <- cb

#       [,1]  [,2]  [,3]  [,4] [,5] [,6]
# [1,]    NA    NA    NA    NA   NA   NA
# [2,] FALSE    NA    NA    NA   NA   NA
# [3,]  TRUE FALSE    NA    NA   NA   NA
# [4,]  TRUE FALSE  TRUE    NA   NA   NA
# [5,] FALSE FALSE FALSE FALSE   NA   NA
# [6,] FALSE FALSE FALSE FALSE TRUE   NA

Here, we see for instance that row 1 is duplicate with row 3 and 4, but not row 2.

What you eventually need is to know whether a row is duplicate with any other row:

colSums(mat, na.rm = TRUE) > 0 | rowSums(mat, na.rm = TRUE) > 0
#[1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
like image 103
Maël Avatar answered Oct 24 '25 02:10

Maël



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!