I need to extract 2Mil observations out of 23Mil data set. Using the code below it takes a lot of time to get it done. On Xeon CPU with 16GB RAM it's still running after 12 hours. I also noticed that the CPU is running at only 25% and HD is on 43%. How can I make the sampling process run faster? Attached is the two lines of code I'm using
prb <- ifelse(dat$target=='1', 1.0, 0.05)
smpl <- dat[sample(nrow(dat), 2000000, prob = prb), ]
The sample function called with unequal probabilities and with replace = FALSE, probably doesn't exactly do what you want it to do: it draws one sample, then recalculates the remaining probabilities so that they add up to one, then draws one additional sample, etc. This makes is slow, and the probabilities don't match the original anymore.
One solution, in your case would be to divide your data set in two (target == '1' and target != '1') and calculate separate samples for each. You would only have to calculate how many elements you want to select in each group.
Another solution is to use the sampling methods from the sampling package. For example, systematic sampling:
library(sampling)
nsample <- 2E6
# Scale probabilities: add up to the number of elements we want
prb <- nsample/sum(prb) * prb
# Sample
smpl <- UPrandomsystematic(prb)
This takes approx 3 seconds on my system.
Checking the output:
> t <- table(smpl, prb)
> sum(smpl)
[1] 2e+06
> t[2,2]/t[2,1]
[1] 19.96854
We have indeed 2E6 records selected and the inclusion probabilities for target == 1 is 20 times smaller than for target != 1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With