Checking whether sets of columns are the same, row wise in R, in any order

Question

I am working in R, and would prefer a dplyr solution if possible.

sample data:

data.frame(
  col1 = c("a", "b", "c", "d"),
  col2 = c("a", "b", "d", "a"),
  col3 = rep("a", 4L),
  col4 = c("a", "b", "d", "a"),
  col5 = c("a", "a", "c", "d"),
  col6 = rep(c("b", "a"), each = 2L)
)

col1	col2	col3	col4	col5	col6
a	a	a	a	a	b
b	b	a	b	a	b
c	d	a	d	c	a
d	a	a	a	d	a

Question

I would like to know for each row, whether col1, col2 and col3 are the same as col4, col5 and col6, but the order of col1 - col3 and col4 - col6 should be ignored.

So for row 1, if col1 - col3 contained a,a,b respectively, and col4 - col6 contained b,a,a respectively, then that would be considered a match.

Desired result

Have put a note on "assessment" column to aid understanding

col1	col2	col3	col4	col5	col6	assessment
a	a	a	a	a	b	FALSE (because 1-3 are not same as 4-6)
b	b	a	b	a	b	TRUE (because 1-3 are the same as 4-6, if ignore order)
c	d	a	d	c	a	TRUE (because 1-3 are the same as 4-6, if ignore order)
d	a	a	a	d	a	TRUE (because 1-3 are the same as 4-6, if ignore order)

sindri_baldur · Accepted Answer

Base R:

df$assessment <- apply(df, 1, \(x) identical(table(x[1:3]), table(x[4:6])))

#   col1 col2 col3 col4 col5 col6 assessment
# 1    a    a    a    a    a    b      FALSE
# 2    b    b    a    b    a    b       TRUE
# 3    c    d    a    d    c    a       TRUE
# 4    d    a    a    a    d    a       TRUE

Reproducible data:

df <- data.frame(
  col1 = c("a", "b", "c", "d"), col2 = c("a", "b", "d", "a"),
  col3 = c("a", "a", "a", "a"), col4 = c("a", "b", "d", "a"),
  col5 = c("a", "a", "c", "d"), col6 = c("b", "b", "a", "a")
)

PS: Why table() and indentical instead of sort(), ==, all()? I would expect it to scale better with the number of columns (given low number of unique values). Example:

df <- as.data.frame(lapply(1:600, \(x) sample(letters, size = 4000, replace = TRUE)))
bench::mark(
  apply(df, 1, \(x) identical(table(x[1:300]), table(x[301:600]))),
  apply(df, 1, \(x) all(sort(x[1:300]) == sort(x[301:600])))
)
#   expression                                                   min median `itr/sec` mem_alloc 
#   <bch:expr>                                                 <bch> <bch:>     <dbl> <bch:byt>   
# 1 apply(df, 1, function(x) identical(table(x[1:300]), table… 1.68s  1.68s     0.594     333MB   
# 2 apply(df, 1, function(x) all(sort(x[1:300]) == sort(x[301… 9.01s  9.01s     0.111     191MB

PS 2: Replacing table(x) with collapse::fcount(x, sort = TRUE) gives a further speedup.

Checking whether sets of columns are the same, row wise in R, in any order

Tags:

r

dplyr

rowwise

fe108

1 Answers

sindri_baldur

Recent Activity

Donate For Us

Checking whether sets of columns are the same, row wise in R, in any order

Tags:

r

dplyr

rowwise

fe108

1 Answers

sindri_baldur

Related questions

Recent Activity

Donate For Us