count distinct and calculate percent differences in two columns ifrom R dataframe

Question

have a simple dataframe listing types of products and tag number per a couple of months:

data.frame(
  stringsAsFactors = FALSE,
                month = c("jan","jan","jan","jan",
                       "jan","feb","feb","feb","feb"),
                category = c("TB", "GT", "TB", "YT", "GT", "TB", "GT", "TB", "YT"),
                tag_number = c(101L, 101L, 223L, 223L, 223L, 345L, 345L, 655L, 223L)
)

month category tag_number

jan      TB     101
jan      GT     101
jan      TB     223
jan      YT     223
jan      GT     223
feb      TB     345
feb      GT     345
feb      TB     655
feb      YT     223

I want to compare and extract the percentual difference between the unique tag_number values by month and category.

Let me explain a bit more as this seems a complicated issue.

If we get grouped month and categories we get to compare this table (month+category)

Jan  TB  101, 223    vs.    Feb  TB  345, 655
Jan  GT  101, 223    vs.    Feb  GT  345
Jan  YT  223         vs.    Feb  YT  223

In January the TB category has two unique tag numbers (101 and 223). If you compare with February there are two unique tags too, but none of them are equal so the percent difference between months is 100% and the count distinct is two in both cases.

Same for GT category. All tags are different. So again 100%.

Different case is YT. Both months contained the same tag number so the difference is 0%

Here expected result. The percent diff is the percentage of cases distinct between months.

Let's take TB category.

4 unique values total, 2 unique per month, none of them equal so:

4 / 4 = 1 (so 100%)

category   pct_diff
TB           100%
GT           100%
YT             0%

KU99 · Accepted Answer

df %>%
   group_by(category) %>%
   summarise(perc_diff = 100 * mean(table(tag_number) == 1))

# A tibble: 3 x 2
  category perc_diff
  <chr>        <dbl>
1 GT             100
2 TB             100
3 YT               0

Pete Kittinun · Answer

Try this solution, but I am not sure whether it can be generalized to a larger case. The approach is to create a wide table then calculate difference percentage.

library(dplyr)
library(tidyr)
library(purrr)

df <- data.frame(
  stringsAsFactors = FALSE,
  month = c("jan","jan","jan","jan",
            "jan","feb","feb","feb","feb"),
  category = c("TB", "GT", "TB", "YT", "GT", "TB", "GT", "TB", "YT"),
  tag_number = c(101L, 101L, 223L, 223L, 223L, 345L, 345L, 655L, 223L)
 )

df2 <- df %>% 
  pivot_wider(names_from = "month", values_from = "tag_number") %>% 
  mutate(both_months = map2(jan, feb, ~intersect(.x,.y))) %>%  #find intersect
  mutate(all_lenght = map2(jan,feb, ~union(.x,.y))) %>%  #find all length
  mutate(percent_diff = map2(both_months, all_lenght, 
                             ~(100 - length(.x)*100/length(.y))) %>% 
                        unlist()) # calculate difference percentage

df2

#> # A tibble: 3 x 6
#>   category jan       feb       both_months all_lenght percent_diff
#>   <chr>    <list>    <list>    <list>      <list>            <dbl>
#> 1 TB       <int [2]> <int [2]> <int [0]>   <int [4]>           100
#> 2 GT       <int [2]> <int [1]> <int [0]>   <int [3]>           100
#> 3 YT       <int [1]> <int [1]> <int [1]>   <int [1]>             0

count distinct and calculate percent differences in two columns ifrom R dataframe

Tags:

r

dplyr

Forge

2 Answers

KU99

Pete Kittinun

Recent Activity

Donate For Us

count distinct and calculate percent differences in two columns ifrom R dataframe

Tags:

r

dplyr

Forge

2 Answers

KU99

Pete Kittinun

Related questions

Recent Activity

Donate For Us