I am working with R and am faced with the following combinatorial problem. The initial situation is a data frame with 512 rows containing all possible triple combinations of the digits 1 to 8:
expand.grid(rep(list(1:8), 3))
Now I would like to sample 420 rows from this data frame so that the frequency of each digit in each column is as similar as possible.
The randomly produced table would look like this and contains - depending on chance - very fluctuating frequencies.
expand.grid(rep(list(1:8), 3)) %>%
filter(row_number() %in% sample(1:nrow(.), 420))
Does some sort of constraint exist in order to obtain frequencies that are as equal as possible?
Edit: However, the result doesn't have to be random. Is there a way to filter 420 rows with maximally equal frequencies?
Stratified Sampling
Note that expand.grid makes variables such that the first varies fastest, the last slowest ... use stratified sampling, dividing the rows into 8*8=64 groups, strata, and sample 6 or 7 from each, since
420/64
[1] 6.5625
R code for this follows:
set.seed(7 * 11 * 13)
G <- expand.grid(rep(list(1:8), 3))
M <- matrix(1:512, 64, 8, byrow=TRUE)
rows <- apply(M, 1, \(x) sample(x, ifelse(runif(1) <= 0.5, 6, 7))) |> unlist()
m <- length(rows)
DIFF <- setdiff(1:512, rows)
morerows <- sample(DIFF, 420 - m)
rows <- c(rows, morerows)
GG <- G[rows, ]
Then looking at frequency tables for each variable:
lapply(GG, table)
$Var1
1 2 3 4 5 6 7 8
55 49 53 52 50 54 51 56
$Var2
1 2 3 4 5 6 7 8
51 54 53 54 51 51 52 54
$Var3
1 2 3 4 5 6 7 8
53 53 50 54 54 54 50 52
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With