Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to randomly sample files from a filesystem in Python

Is there a performant way to sample files from a file system until you hit a target sample size in Python?

For example, let's say I have 10 million files in an arbitrarily nested folder structure and I want a sample of 20,000 files.

Currently, for small-ish flat directories of ~100k or so, I can do something like this:

import os
import random
sample_size = 20_000
sample = random.sample(list(os.scandir(path)), sample_size)
for direntry in sample:
    print(direntry.path)

However, this doesn't scale up well. So, I thought maybe put the random check in the loop. This sort of works, but has the problem of if the number of files in the directory is close the sample_size, it may not pick up the full target sample_size and I would need to keep track of which files were included in the sample and then keep looping until I fill up the sample bucket.

import os
import random
sample_size = 20_000
count = 0
for direntry in os.scandir(path):
    if random.randint(0, 10) < 5:
        continue
    print(direntry.path)
    count += 1
    if count >= sample_size:
        print("reached sample_size")
        break

Any ideas on how to randomly sample a large selection of files from a large directory structure?

like image 801
lifebythedrop Avatar asked Sep 10 '25 09:09

lifebythedrop


1 Answers

Use iterators/generators so you won't keep all files in memory. And use Reservoir sampling to pick selected samples from the basically a stream of file names.

Code

from pathlib import Path
import random

pathlist = Path("C:/Users/XXX/Documents").glob('**/*.py')
nof_samples = 10

rc = []
for k, path in enumerate(pathlist):
    if k < nof_samples:
        rc.append(str(path)) # because path is object not string
    else:
        i = random.randint(0, k)
        if i < nof_samples:
            rc[i] = str(path)

print(len(rc))
print(rc)
like image 79
Severin Pappadeux Avatar answered Sep 12 '25 22:09

Severin Pappadeux