I am trying to draw a stratified sample from a data set for which a variable exists that indicates how large the sample size per group should be.
library(dplyr)
# example data 
df <- data.frame(id = 1:15,
                 grp = rep(1:3,each = 5), 
                 frq = rep(c(3,2,4), each = 5))
In this example, grp refers to the group I want to sample by and frq is the sample size specificied for that group. 
Using split, I came up with this possible solution, which gives the desired result but seems rather inefficient :
s <- split(df, df$grp)
lapply(s,function(x) sample_n(x, size = unique(x$frq))) %>% 
      do.call(what = rbind)
Is there a way using just dplyr's group_by and sample_n to do this? 
My first thought was:
df %>% group_by(grp) %>% sample_n(size = frq)
but this gives the error:
Error in is_scalar_integerish(size) : object 'frq' not found
This works:
df %>% group_by(grp) %>% sample_n(frq[1])
# A tibble: 9 x 3
# Groups:   grp [3]
     id   grp   frq
  <int> <int> <dbl>
1     3     1     3
2     4     1     3
3     2     1     3
4     6     2     2
5     8     2     2
6    13     3     4
7    14     3     4
8    12     3     4
9    11     3     4
Not sure why it didn't work when you tried it.
library(tidyverse)
# example data 
df <- data.frame(id = 1:15,
                 grp = rep(1:3,each = 5), 
                 frq = rep(c(3,2,4), each = 5))
set.seed(22)
df %>%
  group_by(grp) %>%   # for each group
  nest() %>%          # nest data
  mutate(v = map(data, ~sample_n(data.frame(id=.$id), unique(.$frq)))) %>%  # sample using id values and (unique) frq value
  unnest(v)           # unnest the sampled values
# # A tibble: 9 x 2
#     grp    id
#   <int> <int>
# 1     1     2
# 2     1     5
# 3     1     3
# 4     2     8
# 5     2     9
# 6     3    14
# 7     3    13
# 8     3    15
# 9     3    11
Function sample_n works if you pass as inputs a data frame of ids (not a vector of ids) and one frequency value (for each group).
An alternative version using map2 and generating the inputs for sample_n in advance:
df %>%
  group_by(grp) %>%                                 # for every group
  summarise(d = list(data.frame(id=id)),            # create a data frame of ids
            frq = unique(frq)) %>%                  # get the unique frq value
  mutate(v = map2(d, frq, ~sample_n(.x, .y))) %>%   # sample using data frame of ids and frq value
  unnest(v) %>%                                     # unnest sampled values
  select(-frq)                                      # remove frq column (if needed)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With