Seems the number of resulting rows is different when using distinct vs unique. The data set I am working with is huge. Hope the code is OK to understand.
dt2a <- select(dt, mutation.genome.position, 
  mutation.cds, primary.site, sample.name, mutation.id) %>%
  group_by(mutation.genome.position, mutation.cds, primary.site) %>% 
  mutate(occ = nrow(.)) %>%
  select(-sample.name) %>% distinct()
dim(dt2a)
[1] 2316382       5
## Using unique instead
dt2b <- select(dt, mutation.genome.position, mutation.cds, 
   primary.site, sample.name, mutation.id) %>%
  group_by(mutation.genome.position, mutation.cds, primary.site) %>%
  mutate(occ = nrow(.)) %>%
  select(-sample.name) %>% unique()
dim(dt2b)
[1] 2837982       5
This is the file I am working with:
sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/CosmicMutantExport.tsv.gz
     dt = fread(fl)
This appears to be a result of the group_by Consider this case
dt<-data.frame(g=rep(c("a","b"), each=3),
    v=c(2,2,5,2,7,7))
dt %>% group_by(g) %>% unique()
# Source: local data frame [4 x 2]
# Groups: g
# 
#   g v
# 1 a 2
# 2 a 5
# 3 b 2
# 4 b 7
dt %>% group_by(g) %>% distinct()
# Source: local data frame [2 x 2]
# Groups: g
# 
#   g v
# 1 a 2
# 2 b 2
dt %>% group_by(g) %>% distinct(v)
# Source: local data frame [4 x 2]
# Groups: g
# 
#   g v
# 1 a 2
# 2 a 5
# 3 b 2
# 4 b 7
When you use distinct() without indicating which variables to make distinct, it appears to use the grouping variable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With