Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting a sample to match the distribution of variables in another dataset

Let x be a dataset with 5 variables and 15 observations:

age gender  height  weight  fitness
17  M   5.34    68  medium
23  F   5.58    55  medium
25  M   5.96    64  high
25  M   5.25    60  medium
18  M   5.57    60  low
17  F   5.74    61  low
17  M   5.96    71  medium
22  F   5.56    75  high
16  F   5.02    56  medium
21  F   5.18    63  low
20  M   5.24    57  medium
15  F   5.47    72  medium
16  M   5.47    61  high
22  F   5.88    73  low
18  F   5.73    62  medium

The frequencies of the values for the fitness variable are as follows: low = 4, medium = 8, high = 3.

Suppose I have another dataset y with the same 5 variables but 100 observations. The frequencies of the values for the fitness variable in this dataset are as follows: low = 42, medium = 45, high = 13.

Using R, how can I obtain a representative sample from y such that the sample fitness closely matches the distribution of the fitness in x?

My initial ideas were to use the sample function in R and assign weighted probabilities for the prob argument. However, using probabilities would force an exact match for the frequency distribution. My objective is to get a close enough match while maximizing the the sample size.

Additionally, suppose I wish to add another constraint where the distribution of the gender must also closely match that of x?

like image 589
Outlier Avatar asked Oct 23 '25 15:10

Outlier


2 Answers

The minimum frequency in your y is 13, corresponding to the "high" fitness level. So you can't sample more than this number. That's your first constraint. You want to maximize your sample size, so you sample all 13. To match the proportions in x, 13 should be 20% of your total, which means your total must be 65 (13/0.2). The other frequencies must therefore be 17 (low) and 35 (moderate). Since you have enough of these fitness levels in your y, you can take this as your sample. If any of the other sample frequencies exceeded the number in y, then you'd have another constraint and would have to adjust these accordingly.

For sampling, you'd first select all records with "high" fitness (sampling with certainty). Next, sample from the other levels separately (stratified random sampling). Finally, combine all three.

Example:

rm(list=ls())
# set-up the data (your "y"):
df <- data.frame(age=round(rnorm(100, 20, 5)), 
                 gender=factor(gl(2,50), labels=LETTERS[c(6, 13)]), 
                 height=round(rnorm(100, 12, 3)), 
                 fitness=factor(c(rep("low", 42), rep("medium", 45), rep("high", 13)), 
                                levels=c("low","medium","high")))

Create subsets for sampling:

fit.low <- subset(df, subset=fitness=="low")
fit.medium <- subset(df, subset=fitness=="medium")
fit.high <- subset(df, subset=fitness=="high")

Sample 17 from the low fitness group (40.5% or 26.7% of the total).

fit.low_sam <- fit.low[sample(1:42, 17),]

Sample 35 from the medium fitness group (77.8% or 53.8% of the total).

fit.med_sam <- fit.medium[sample(1:45, 35),]

Combine them all.

fit.sam <- rbind(fit.low_sam, fit.med_sam, fit.high)

I tried to do this using the sample_n and sample_frac functions from dplyr but I think these functions don't allow you to do stratified sampling with different proportions.

library(dplyr)
df %>%
  group_by(fitness) %>%
  sample_n(size=c(17,35,13), weight=c(0.27, 0.53, 0.2))
# Error

But the sampling package can certainly do this. Stratified random sampling from data frame

library(sampling)
s <- strata(df, "fitness", size=c(17,35,13), "srswor")
getdata(df, s)
like image 86
Edward Avatar answered Oct 26 '25 05:10

Edward


Consider using rmultinom to prepare samples counts in each level of fitness.

Prepare the data (I have used y preparation from @Edward answer)

x <- read.table(text = "age gender  height  weight  fitness
17  M   5.34    68  medium
23  F   5.58    55  medium
25  M   5.96    64  high
25  M   5.25    60  medium
18  M   5.57    60  low
17  F   5.74    61  low
17  M   5.96    71  medium
22  F   5.56    75  high
16  F   5.02    56  medium
21  F   5.18    63  low
20  M   5.24    57  medium
15  F   5.47    72  medium
16  M   5.47    61  high
22  F   5.88    73  low
18  F   5.73    62  medium", header = TRUE)

y <- data.frame(age=round(rnorm(100, 20, 5)), 
                 gender=factor(gl(2,50), labels=LETTERS[c(6, 13)]), 
                 height=round(rnorm(100, 12, 3)), 
                 fitness=factor(c(rep("low", 42), rep("medium", 45), rep("high", 13)), 
                                levels=c("low","medium","high")))

Now the sampling procedure: UPD: I have changed the code for two variables case (gender and fitness)

library(tidyverse)

N_SAMPLES = 100

# Calculate frequencies
freq <- x %>%
    group_by(fitness, gender) %>% # You can set any combination of factors
    summarise(freq = n() / nrow(x)) 

# Prepare multinomial distribution
distr <- rmultinom(N_SAMPLES, 1, freq$freq)
# Convert to counts
freq$counts <- rowSums(distr)

# Join y with frequency for further use in sampling
y_count <- y %>% left_join(freq)

# Perform sampling using multinomial distribution counts
y_sampled <- y_count %>%
    group_by(fitness, gender) %>% # Should be the same as in frequencies calculation
    # Check if count is greater then number of observations
    sample_n(size = ifelse(n() > first(counts), first(counts), n()),
        replace = FALSE) %>%
    select(-freq, -counts)
like image 44
Istrel Avatar answered Oct 26 '25 07:10

Istrel