Is there a performant way to sample files from a file system until you hit a target sample size in Python?
For example, let's say I have 10 million files in an arbitrarily nested folder structure and I want a sample of 20,000 files.
Currently, for small-ish flat directories of ~100k or so, I can do something like this:
import os
import random
sample_size = 20_000
sample = random.sample(list(os.scandir(path)), sample_size)
for direntry in sample:
print(direntry.path)
However, this doesn't scale up well. So, I thought maybe put the random check in the loop. This sort of works, but has the problem of if the number of files in the directory is close the sample_size
, it may not pick up the full target sample_size
and I would need to keep track of which files were included in the sample and then keep looping until I fill up the sample bucket.
import os
import random
sample_size = 20_000
count = 0
for direntry in os.scandir(path):
if random.randint(0, 10) < 5:
continue
print(direntry.path)
count += 1
if count >= sample_size:
print("reached sample_size")
break
Any ideas on how to randomly sample a large selection of files from a large directory structure?
Use iterators/generators so you won't keep all files in memory. And use Reservoir sampling to pick selected samples from the basically a stream of file names.
Code
from pathlib import Path
import random
pathlist = Path("C:/Users/XXX/Documents").glob('**/*.py')
nof_samples = 10
rc = []
for k, path in enumerate(pathlist):
if k < nof_samples:
rc.append(str(path)) # because path is object not string
else:
i = random.randint(0, k)
if i < nof_samples:
rc[i] = str(path)
print(len(rc))
print(rc)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With