Here is the data:
df <-
data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4),
value = LETTERS[1:20])
I need to randomly select sequences of four values from each group with dplyr
. Selected values should be in the same order as in the data, and there should be no gaps between them.
Desired result may look like this:
group value
1 1 A
2 1 B
3 1 C
4 1 D
6 2 F
7 2 G
8 2 H
9 2 I
11 3 K
12 3 L
13 3 M
14 3 N
17 4 Q
18 4 R
19 4 S
20 4 T
group value
1 1 A
2 1 B
3 1 C
4 1 D
5 2 E
6 2 F
7 2 G
8 2 H
10 3 J
11 3 K
12 3 L
13 3 M
17 4 Q
18 4 R
19 4 S
20 4 T
This is where I am in solving this:
set.seed(23)
df %>%
group_by(group) %>%
mutate(selected = sample(0:1, size = n(), replace = TRUE)) %>%
filter(selected == 1)
However, I couldn't figure out how to generate exactly 4 ones in a row, with zeroes before or after them.
We can sample
the number of rows (minus three) in the group, size one, and add 0:3
to that to select which rows we retain.
set.seed(42)
df %>%
group_by(group) %>%
filter(row_number() %in% c(sample(max(1, n()-3), size=1) + 0:3)) %>%
ungroup()
# # A tibble: 16 × 2
# group value
# <dbl> <chr>
# 1 1 A
# 2 1 B
# 3 1 C
# 4 1 D
# 5 2 E
# 6 2 F
# 7 2 G
# 8 2 H
# 9 3 J
# 10 3 K
# 11 3 L
# 12 3 M
# 13 4 Q
# 14 4 R
# 15 4 S
# 16 4 T
Safety steps here:
max(1, n()-3)
makes sure that we don't attempt to sample negative (or zero) row numbersrow_number() %in% ...
will never try to index rows that don't exist, even if c(sample(.) + 0:3)
might suggest more rows than exist.You can try a bit with embed
(but not as efficient as the answer by @r2evans)
df %>%
filter(
value %in% embed(value, 4)[sample.int(n() - 3, 1), ],
.by = group
)
or
df %>%
summarise(
value = list(embed(value, 4)[sample.int(n() - 3, 1), 4:1]),
.by = group
) %>%
unnest(value)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With