ThreadPoolExecutor: threads (futures) do not release memory when completed and the results are yielded

Question

I am facing memory issues during downloading large data sets from paginated API responses in python. When I tried to parallelize the download of multiple pages using ThreadPoolExecutor, I noticed that the finished and resolved futures do not release its memory footprint.

I tried to simplify it in following 2 examples. The first one downloads all pages using ThreadPoolExecutor with max_workers set to 1 (this should have the same memory footprint as a simple loop as far as I understand):

from random import random
from concurrent.futures  import ThreadPoolExecutor, as_completed
import gc

TOTAL_PAGES = 60 

def download_data(page: int = 1) -> list[float]:
    # Send a request to some resource to get data
    print(f"Downloading page {page}.")
    return [random() for _ in range(1000000)] # mock some larga data sets

def threadpool_memory_test(): 
    processed_pages = 0
    with ThreadPoolExecutor(max_workers=1) as executor:
        future_to_page = {
            executor.submit(download_data, page): page for page in range(1, TOTAL_PAGES + 1)
        }
        
        for future in as_completed(future_to_page):
            records = future.result()
            # Do something with the downloaded data..
            processed_pages += 1
            print(f"Downloaded page: {processed_pages} / {TOTAL_PAGES} (number: {future_to_page[future]}) with {len(records)} records.")
            gc.collect() # just to be sure gc is called

if __name__ == "__main__":
    threadpool_memory_test()

However when running this script and plotting the memory footprint, it looks like this: memory fooprint of threadpool The futures do not release their memory even when looped through with as_completed and obtaining the results.

When I download and process the pages in simple loop. The memory footprint is as expected:

from random import random

TOTAL_PAGES = 60 
def download_data(page: int = 1) -> list[float]:
    # Send a request to some resource to get data
    print(f"Downloading page {page}.")
    return [random() for _ in range(1000000)] # mock some larga data sets

def loop_memory_test():
    for page in range(1, TOTAL_PAGES + 1):
        records = download_data(page)
        # Do something with the downloaded data..
        print(f"Downloaded page: {page} / {TOTAL_PAGES} with {len(records)} records.")

if __name__ == "__main__":
    loop_memory_test()

Memory footprint of such script: memory footprint of simple loop

Is there a way how to release memory of a future from which the results were already obtained?

I am testing this on macOs Monterey version 12.5 (21G72)

Ondřej Sláma · Accepted Answer

Based on SIGHUP's comment I updated the script and it now works as expected (it is also 10x faster as well as uses fraction of the memory):

from random import random
from concurrent.futures  import ThreadPoolExecutor, as_completed
import gc

TOTAL_PAGES = 60 

def download_data(page: int = 1) -> list[float]:
    # Send a request to some resource to get data
    print(f"Downloading page {page}.")
    return [random() for _ in range(1000000)] # mock some larga data sets

def threadpool_memory_test(): 
    processed_pages = 0
    with ThreadPoolExecutor(max_workers=1) as executor:
        future_to_page = {
            executor.submit(download_data, page): page for page in range(1, TOTAL_PAGES + 1)
        }

        for future in as_completed(future_to_page):
            records = future.result()
            page = future_to_page.pop(future)
            # Do something with the downloaded data..
            processed_pages += 1
            print(f"Downloaded page: {processed_pages} / {TOTAL_PAGES} (number: {page}) with {len(records)} records.")
            gc.collect() # just to be sure gc is called

if __name__ == "__main__":
    threadpool_memory_test()

It boils down just to this line:

page = future_to_page.pop(future)

making sure the reference to the future is removed.

The memory footprint now: memory footprint of updated threadpool test script

Thank you!

ThreadPoolExecutor: threads (futures) do not release memory when completed and the results are yielded

Tags:

python

memory-leaks

multithreading

python-multithreading

Ondřej Sláma

1 Answers

Ondřej Sláma

Recent Activity

Donate For Us

ThreadPoolExecutor: threads (futures) do not release memory when completed and the results are yielded

Tags:

python

memory-leaks

multithreading

python-multithreading

Ondřej Sláma

1 Answers

Ondřej Sláma

Related questions

Recent Activity

Donate For Us