Select sequences of values from groups at random

Question

Here is the data:

df <-
  data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4),
             value = LETTERS[1:20])

I need to randomly select sequences of four values from each group with dplyr. Selected values should be in the same order as in the data, and there should be no gaps between them.

Desired result may look like this:

   group value
1      1     A
2      1     B
3      1     C
4      1     D
6      2     F
7      2     G
8      2     H
9      2     I
11     3     K
12     3     L
13     3     M
14     3     N
17     4     Q
18     4     R
19     4     S
20     4     T

   group value
1      1     A
2      1     B
3      1     C
4      1     D
5      2     E
6      2     F
7      2     G
8      2     H
10     3     J
11     3     K
12     3     L
13     3     M
17     4     Q
18     4     R
19     4     S
20     4     T

This is where I am in solving this:

set.seed(23)
df %>% 
  group_by(group) %>% 
  mutate(selected = sample(0:1, size = n(), replace = TRUE)) %>% 
  filter(selected == 1)

However, I couldn't figure out how to generate exactly 4 ones in a row, with zeroes before or after them.

r2evans · Accepted Answer

We can sample the number of rows (minus three) in the group, size one, and add 0:3 to that to select which rows we retain.

set.seed(42)
df %>%
  group_by(group) %>%
  filter(row_number() %in% c(sample(max(1, n()-3), size=1) + 0:3)) %>%
  ungroup()
# # A tibble: 16 × 2
#    group value
#    <dbl> <chr>
#  1     1 A    
#  2     1 B    
#  3     1 C    
#  4     1 D    
#  5     2 E    
#  6     2 F    
#  7     2 G    
#  8     2 H    
#  9     3 J    
# 10     3 K    
# 11     3 L    
# 12     3 M    
# 13     4 Q    
# 14     4 R    
# 15     4 S    
# 16     4 T

Safety steps here:

max(1, n()-3) makes sure that we don't attempt to sample negative (or zero) row numbers
if we have a group with fewer than 4 rows, this still works (selecting all rows) since row_number() %in% ... will never try to index rows that don't exist, even if c(sample(.) + 0:3) might suggest more rows than exist.

ThomasIsCoding · Answer

You can try a bit with embed (but not as efficient as the answer by @r2evans)

df %>%
    filter(
        value %in% embed(value, 4)[sample.int(n() - 3, 1), ],
        .by = group
    )

or

df %>%
    summarise(
        value = list(embed(value, 4)[sample.int(n() - 3, 1), 4:1]),
        .by = group
    ) %>%
    unnest(value)

Select sequences of values from groups at random

Tags:

random

r

dplyr

Polina B

2 Answers

r2evans

ThomasIsCoding

Recent Activity

Donate For Us

Select sequences of values from groups at random

Tags:

random

r

dplyr

Polina B

2 Answers

r2evans

ThomasIsCoding

Related questions

Recent Activity

Donate For Us