Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checking whether sets of columns are the same, row wise in R, in any order

Tags:

r

dplyr

rowwise

I am working in R, and would prefer a dplyr solution if possible.

sample data:

data.frame(
  col1 = c("a", "b", "c", "d"),
  col2 = c("a", "b", "d", "a"),
  col3 = rep("a", 4L),
  col4 = c("a", "b", "d", "a"),
  col5 = c("a", "a", "c", "d"),
  col6 = rep(c("b", "a"), each = 2L)
)
col1 col2 col3 col4 col5 col6
a a a a a b
b b a b a b
c d a d c a
d a a a d a

Question

I would like to know for each row, whether col1, col2 and col3 are the same as col4, col5 and col6, but the order of col1 - col3 and col4 - col6 should be ignored.

So for row 1, if col1 - col3 contained a,a,b respectively, and col4 - col6 contained b,a,a respectively, then that would be considered a match.

Desired result

Have put a note on "assessment" column to aid understanding

col1 col2 col3 col4 col5 col6 assessment
a a a a a b FALSE (because 1-3 are not same as 4-6)
b b a b a b TRUE (because 1-3 are the same as 4-6, if ignore order)
c d a d c a TRUE (because 1-3 are the same as 4-6, if ignore order)
d a a a d a TRUE (because 1-3 are the same as 4-6, if ignore order)
like image 249
fe108 Avatar asked Dec 09 '25 12:12

fe108


1 Answers

Base R:

df$assessment <- apply(df, 1, \(x) identical(table(x[1:3]), table(x[4:6])))

#   col1 col2 col3 col4 col5 col6 assessment
# 1    a    a    a    a    a    b      FALSE
# 2    b    b    a    b    a    b       TRUE
# 3    c    d    a    d    c    a       TRUE
# 4    d    a    a    a    d    a       TRUE

Reproducible data:

df <- data.frame(
  col1 = c("a", "b", "c", "d"), col2 = c("a", "b", "d", "a"),
  col3 = c("a", "a", "a", "a"), col4 = c("a", "b", "d", "a"),
  col5 = c("a", "a", "c", "d"), col6 = c("b", "b", "a", "a")
)

PS: Why table() and indentical instead of sort(), ==, all()? I would expect it to scale better with the number of columns (given low number of unique values). Example:

df <- as.data.frame(lapply(1:600, \(x) sample(letters, size = 4000, replace = TRUE)))
bench::mark(
  apply(df, 1, \(x) identical(table(x[1:300]), table(x[301:600]))),
  apply(df, 1, \(x) all(sort(x[1:300]) == sort(x[301:600])))
)
#   expression                                                   min median `itr/sec` mem_alloc 
#   <bch:expr>                                                 <bch> <bch:>     <dbl> <bch:byt>   
# 1 apply(df, 1, function(x) identical(table(x[1:300]), table… 1.68s  1.68s     0.594     333MB   
# 2 apply(df, 1, function(x) all(sort(x[1:300]) == sort(x[301… 9.01s  9.01s     0.111     191MB 

PS 2: Replacing table(x) with collapse::fcount(x, sort = TRUE) gives a further speedup.

like image 96
sindri_baldur Avatar answered Dec 12 '25 07:12

sindri_baldur



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!