Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Make sampling run faster

I need to extract 2Mil observations out of 23Mil data set. Using the code below it takes a lot of time to get it done. On Xeon CPU with 16GB RAM it's still running after 12 hours. I also noticed that the CPU is running at only 25% and HD is on 43%. How can I make the sampling process run faster? Attached is the two lines of code I'm using

prb <- ifelse(dat$target=='1', 1.0, 0.05)
smpl <- dat[sample(nrow(dat), 2000000, prob = prb), ]
like image 716
mql4beginner Avatar asked Feb 26 '26 18:02

mql4beginner


1 Answers

The sample function called with unequal probabilities and with replace = FALSE, probably doesn't exactly do what you want it to do: it draws one sample, then recalculates the remaining probabilities so that they add up to one, then draws one additional sample, etc. This makes is slow, and the probabilities don't match the original anymore.

One solution, in your case would be to divide your data set in two (target == '1' and target != '1') and calculate separate samples for each. You would only have to calculate how many elements you want to select in each group.

Another solution is to use the sampling methods from the sampling package. For example, systematic sampling:

library(sampling)

nsample <- 2E6

# Scale probabilities: add up to the number of elements we want
prb <- nsample/sum(prb) * prb

# Sample
smpl <- UPrandomsystematic(prb)

This takes approx 3 seconds on my system.

Checking the output:

> t <- table(smpl, prb)
> sum(smpl)
[1] 2e+06
> t[2,2]/t[2,1]
[1] 19.96854

We have indeed 2E6 records selected and the inclusion probabilities for target == 1 is 20 times smaller than for target != 1.

like image 115
Jan van der Laan Avatar answered Mar 01 '26 10:03

Jan van der Laan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!