Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count similarity of occurrences across columns R

Tags:

r

dplyr

I have the following data:

df <- data.frame(
  group = c('r1','r2','r3','r4'),
  X1 = c('A','B','C','K'),
  X2 = c('A','C','M','K'),
  X3 = c('D','A','C','K')
)

> df
  group X1 X2 X3
1    r1  A  A  D
2    r2  B  C  A
3    r3  C  M  C
4    r4  K  K  K

I want to estimate a 'similarity score' based on columns X1, X2 & X3. For example, within group r1 (or row 1), 2 out of 3 elements are similar, so score is 2/3 (~67%). And the group r4 (or row 4), the score would be 3/3 (100%). The desired outcome is below:

> df
  group X1 X2 X3 similarity_score
1    r1  A  A  D .67
2    r2  B  C  A .33
3    r3  C  M  C .67
4    r4  K  K  K 1

How can I achieve this?

like image 574
unaeem Avatar asked Dec 02 '25 04:12

unaeem


2 Answers

Another possible solution:

library(dplyr)

df %>% 
  rowwise %>% 
  mutate(score = max(prop.table(table(c_across(X1:X3))))) %>% 
  ungroup

#> # A tibble: 4 × 5
#>   group X1    X2    X3    score
#>   <chr> <chr> <chr> <chr> <dbl>
#> 1 r1    A     A     D     0.667
#> 2 r2    B     C     A     0.333
#> 3 r3    C     M     C     0.667
#> 4 r4    K     K     K     1

Or even shorter:

library(tidyverse)
df %>% mutate(score = pmap_dbl(across(X1:X3), ~ max(prop.table(table(c(...))))))
like image 151
PaulS Avatar answered Dec 03 '25 23:12

PaulS


You could do

df$similarity <- round(apply(df[-1], 1, function(x) max(table(x))/length(x)), 2)

df
#>   group X1 X2 X3 similarity
#> 1    r1  A  A  D       0.67
#> 2    r2  B  C  A       0.33
#> 3    r3  C  M  C       0.67
#> 4    r4  K  K  K       1.00

Created on 2022-04-18 by the reprex package (v2.0.1)

like image 26
Allan Cameron Avatar answered Dec 03 '25 21:12

Allan Cameron



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!