Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python out-of-memory error, multithreaded subdirectory recursion

I'm experimenting for the first time with recursion and am running into problems when scanning large directories. The following code takes a list of glob patterns (e.g., ['/opt/data/large_dir_1*', '/opt/data'large_dir_2']), expands the glob patterns, passing the resulting lists of files/directories to threads, along the way totalling the number of directories found, number of files found, and the total byte size of the files found. The subdirectories are large (several of them have hundreds of thousands of directories, millions of files), but I'm suprised that some of the threads I've spawned are throwing "MemoryError" exceptions.

My guess is that the problems are occurring because the 'dirList' and 'fileList' variables are taking up too much memory. My backup plan is to have the recursive function simply write the data to log files rather than return it, but I'm trying to avoid the use of 'global' variables as much as possible. Does anyone have any thoughts on a better way to proceed? Am I doing something stupid here? Thanks for any help you can provide.

def Scan(masterFileList, baseDir='/'):
    dirList = []
    fileList = []
    byteCount = 0
    for fileOrDir in masterFileList:
        fullName = os.path.join(baseDir,fileOrDir)
        if os.path.isdir(fullName):
            dirList.append(fullName)
            # recursion: call Scan():
            dirs, files, bytes = Scan(os.listdir(fullName),fullName)
            dirList.extend(dirs)
            fileList.extend(files)
            byteCount += bytes
        elif os.path.isfile(fullName):
            fileList.append(fullName)
            byteCount += os.path.getsize(fullName)
    return dirList, fileList, byteCount


dirList = []
fileList = []
byteCount = 0
errorList = []

def doScan(dataQueue):
    print('Thread starting')

    while not dataQueue.empty():
        globPattern = dataQueue.get()
        globbed = glob.glob(globPattern)
        if globbed:
            dirs, files, bytes = Scan(globbed)
            # if we have a lock:
            with safePrint:
                dirList.extend(dirs)
                write_to(dirLog, 'a', dirs) 
                fileList.extend(files)
                write_to(fileLog, 'a', files)       
                byteCount += bytes
                # convert to string for writing:
                write_to(byteLog, 'w', str(byteCount))
        else:
            # if we have a lock:
            with safePrint:
                errorList.append(globPattern)
                write_to(errorLog, 'a', globPattern)                

    print('Thread exiting')

numthreads = 0
for globPattern in globList:
    dataQueue.put(globPattern)
    numthreads += 1

# initialize threads:
threads = []
for i in range(numthreads):
        thread = threading.Thread(target=doScan, args=(dataQueue, ))
        threads.append(thread)
        thread.start()

# wait until threads are done:
for thread in threads: thread.join()
like image 709
rumdrums Avatar asked Dec 17 '25 16:12

rumdrums


1 Answers

This sounds like a perfect use case for os.walk, which lets you recursively walk a file system using a simple for loop. The official documentation at https://docs.python.org/3/library/os.html#os.walk contains an example that's quite like your use case.

If you want to do multi-threaded processing, you can start worker threads that consume items from a queue.Queue that you can fill from this for loop.

like image 79
wouter bolsterlee Avatar answered Dec 20 '25 06:12

wouter bolsterlee



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!