I am working in R, and would prefer a dplyr solution if possible.
sample data:
data.frame(
col1 = c("a", "b", "c", "d"),
col2 = c("a", "b", "d", "a"),
col3 = rep("a", 4L),
col4 = c("a", "b", "d", "a"),
col5 = c("a", "a", "c", "d"),
col6 = rep(c("b", "a"), each = 2L)
)
| col1 | col2 | col3 | col4 | col5 | col6 |
|---|---|---|---|---|---|
| a | a | a | a | a | b |
| b | b | a | b | a | b |
| c | d | a | d | c | a |
| d | a | a | a | d | a |
Question
I would like to know for each row, whether col1, col2 and col3 are the same as col4, col5 and col6, but the order of col1 - col3 and col4 - col6 should be ignored.
So for row 1, if col1 - col3 contained a,a,b respectively, and col4 - col6 contained b,a,a respectively, then that would be considered a match.
Desired result
Have put a note on "assessment" column to aid understanding
| col1 | col2 | col3 | col4 | col5 | col6 | assessment |
|---|---|---|---|---|---|---|
| a | a | a | a | a | b | FALSE (because 1-3 are not same as 4-6) |
| b | b | a | b | a | b | TRUE (because 1-3 are the same as 4-6, if ignore order) |
| c | d | a | d | c | a | TRUE (because 1-3 are the same as 4-6, if ignore order) |
| d | a | a | a | d | a | TRUE (because 1-3 are the same as 4-6, if ignore order) |
Base R:
df$assessment <- apply(df, 1, \(x) identical(table(x[1:3]), table(x[4:6])))
# col1 col2 col3 col4 col5 col6 assessment
# 1 a a a a a b FALSE
# 2 b b a b a b TRUE
# 3 c d a d c a TRUE
# 4 d a a a d a TRUE
Reproducible data:
df <- data.frame(
col1 = c("a", "b", "c", "d"), col2 = c("a", "b", "d", "a"),
col3 = c("a", "a", "a", "a"), col4 = c("a", "b", "d", "a"),
col5 = c("a", "a", "c", "d"), col6 = c("b", "b", "a", "a")
)
PS: Why table() and indentical instead of sort(), ==, all()? I would expect it to scale better with the number of columns (given low number of unique values). Example:
df <- as.data.frame(lapply(1:600, \(x) sample(letters, size = 4000, replace = TRUE)))
bench::mark(
apply(df, 1, \(x) identical(table(x[1:300]), table(x[301:600]))),
apply(df, 1, \(x) all(sort(x[1:300]) == sort(x[301:600])))
)
# expression min median `itr/sec` mem_alloc
# <bch:expr> <bch> <bch:> <dbl> <bch:byt>
# 1 apply(df, 1, function(x) identical(table(x[1:300]), table… 1.68s 1.68s 0.594 333MB
# 2 apply(df, 1, function(x) all(sort(x[1:300]) == sort(x[301… 9.01s 9.01s 0.111 191MB
PS 2: Replacing table(x) with collapse::fcount(x, sort = TRUE) gives a further speedup.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With