Is there a performant way to sample files from a file system until you hit a target sample size in Python?
For example, let's say I have 10 million files in an arbitrarily nested folder structure and I want a sample of 20,000 files.
Currently, for small-ish flat directories of ~100k or so, I can do something like this:
import os
import random
sample_size = 20_000
sample = random.sample(list(os.scandir(path)), sample_size)
for direntry in sample:
    print(direntry.path)
However, this doesn't scale up well.  So, I thought maybe put the random check in the loop.  This sort of works, but has the problem of if the number of files in the directory is close the sample_size, it may not pick up the full target sample_size and I would need to keep track of which files were included in the sample and then keep looping until I fill up the sample bucket.
import os
import random
sample_size = 20_000
count = 0
for direntry in os.scandir(path):
    if random.randint(0, 10) < 5:
        continue
    print(direntry.path)
    count += 1
    if count >= sample_size:
        print("reached sample_size")
        break
Any ideas on how to randomly sample a large selection of files from a large directory structure?
Use iterators/generators so you won't keep all files in memory. And use Reservoir sampling to pick selected samples from the basically a stream of file names.
Code
from pathlib import Path
import random
pathlist = Path("C:/Users/XXX/Documents").glob('**/*.py')
nof_samples = 10
rc = []
for k, path in enumerate(pathlist):
    if k < nof_samples:
        rc.append(str(path)) # because path is object not string
    else:
        i = random.randint(0, k)
        if i < nof_samples:
            rc[i] = str(path)
print(len(rc))
print(rc)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With