Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

count distinct and calculate percent differences in two columns ifrom R dataframe

Tags:

r

dplyr

have a simple dataframe listing types of products and tag number per a couple of months:

data.frame(
  stringsAsFactors = FALSE,
                month = c("jan","jan","jan","jan",
                       "jan","feb","feb","feb","feb"),
                category = c("TB", "GT", "TB", "YT", "GT", "TB", "GT", "TB", "YT"),
                tag_number = c(101L, 101L, 223L, 223L, 223L, 345L, 345L, 655L, 223L)
)

month category tag_number

jan      TB     101
jan      GT     101
jan      TB     223
jan      YT     223
jan      GT     223
feb      TB     345
feb      GT     345
feb      TB     655
feb      YT     223

I want to compare and extract the percentual difference between the unique tag_number values by month and category.

Let me explain a bit more as this seems a complicated issue.

If we get grouped month and categories we get to compare this table (month+category)

Jan  TB  101, 223    vs.    Feb  TB  345, 655
Jan  GT  101, 223    vs.    Feb  GT  345
Jan  YT  223         vs.    Feb  YT  223

In January the TB category has two unique tag numbers (101 and 223). If you compare with February there are two unique tags too, but none of them are equal so the percent difference between months is 100% and the count distinct is two in both cases.

Same for GT category. All tags are different. So again 100%.

Different case is YT. Both months contained the same tag number so the difference is 0%

Here expected result. The percent diff is the percentage of cases distinct between months.

Let's take TB category.

4 unique values total, 2 unique per month, none of them equal so:

4 / 4 = 1 (so 100%)

category   pct_diff
TB           100%
GT           100%
YT             0%
like image 875
Forge Avatar asked Nov 18 '25 19:11

Forge


2 Answers

df %>%
   group_by(category) %>%
   summarise(perc_diff = 100 * mean(table(tag_number) == 1))

# A tibble: 3 x 2
  category perc_diff
  <chr>        <dbl>
1 GT             100
2 TB             100
3 YT               0
like image 153
KU99 Avatar answered Nov 21 '25 08:11

KU99


Try this solution, but I am not sure whether it can be generalized to a larger case. The approach is to create a wide table then calculate difference percentage.

library(dplyr)
library(tidyr)
library(purrr)

df <- data.frame(
  stringsAsFactors = FALSE,
  month = c("jan","jan","jan","jan",
            "jan","feb","feb","feb","feb"),
  category = c("TB", "GT", "TB", "YT", "GT", "TB", "GT", "TB", "YT"),
  tag_number = c(101L, 101L, 223L, 223L, 223L, 345L, 345L, 655L, 223L)
 )

df2 <- df %>% 
  pivot_wider(names_from = "month", values_from = "tag_number") %>% 
  mutate(both_months = map2(jan, feb, ~intersect(.x,.y))) %>%  #find intersect
  mutate(all_lenght = map2(jan,feb, ~union(.x,.y))) %>%  #find all length
  mutate(percent_diff = map2(both_months, all_lenght, 
                             ~(100 - length(.x)*100/length(.y))) %>% 
                        unlist()) # calculate difference percentage

df2

#> # A tibble: 3 x 6
#>   category jan       feb       both_months all_lenght percent_diff
#>   <chr>    <list>    <list>    <list>      <list>            <dbl>
#> 1 TB       <int [2]> <int [2]> <int [0]>   <int [4]>           100
#> 2 GT       <int [2]> <int [1]> <int [0]>   <int [3]>           100
#> 3 YT       <int [1]> <int [1]> <int [1]>   <int [1]>             0
like image 29
Pete Kittinun Avatar answered Nov 21 '25 09:11

Pete Kittinun



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!