Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select sequences of values from groups at random

Tags:

random

r

dplyr

Here is the data:

df <-
  data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4),
             value = LETTERS[1:20])

I need to randomly select sequences of four values from each group with dplyr. Selected values should be in the same order as in the data, and there should be no gaps between them.

Desired result may look like this:

   group value
1      1     A
2      1     B
3      1     C
4      1     D
6      2     F
7      2     G
8      2     H
9      2     I
11     3     K
12     3     L
13     3     M
14     3     N
17     4     Q
18     4     R
19     4     S
20     4     T

   group value
1      1     A
2      1     B
3      1     C
4      1     D
5      2     E
6      2     F
7      2     G
8      2     H
10     3     J
11     3     K
12     3     L
13     3     M
17     4     Q
18     4     R
19     4     S
20     4     T

This is where I am in solving this:

set.seed(23)
df %>% 
  group_by(group) %>% 
  mutate(selected = sample(0:1, size = n(), replace = TRUE)) %>% 
  filter(selected == 1)

However, I couldn't figure out how to generate exactly 4 ones in a row, with zeroes before or after them.

like image 922
Polina B Avatar asked Oct 16 '25 19:10

Polina B


2 Answers

We can sample the number of rows (minus three) in the group, size one, and add 0:3 to that to select which rows we retain.

set.seed(42)
df %>%
  group_by(group) %>%
  filter(row_number() %in% c(sample(max(1, n()-3), size=1) + 0:3)) %>%
  ungroup()
# # A tibble: 16 × 2
#    group value
#    <dbl> <chr>
#  1     1 A    
#  2     1 B    
#  3     1 C    
#  4     1 D    
#  5     2 E    
#  6     2 F    
#  7     2 G    
#  8     2 H    
#  9     3 J    
# 10     3 K    
# 11     3 L    
# 12     3 M    
# 13     4 Q    
# 14     4 R    
# 15     4 S    
# 16     4 T    

Safety steps here:

  • max(1, n()-3) makes sure that we don't attempt to sample negative (or zero) row numbers
  • if we have a group with fewer than 4 rows, this still works (selecting all rows) since row_number() %in% ... will never try to index rows that don't exist, even if c(sample(.) + 0:3) might suggest more rows than exist.
like image 81
r2evans Avatar answered Oct 18 '25 10:10

r2evans


You can try a bit with embed (but not as efficient as the answer by @r2evans)

df %>%
    filter(
        value %in% embed(value, 4)[sample.int(n() - 3, 1), ],
        .by = group
    )

or

df %>%
    summarise(
        value = list(embed(value, 4)[sample.int(n() - 3, 1), 4:1]),
        .by = group
    ) %>%
    unnest(value)
like image 37
ThomasIsCoding Avatar answered Oct 18 '25 09:10

ThomasIsCoding