I am facing memory issues during downloading large data sets from paginated API responses in python. When I tried to parallelize the download of multiple pages using ThreadPoolExecutor, I noticed that the finished and resolved futures do not release its memory footprint.
I tried to simplify it in following 2 examples. The first one downloads all pages using ThreadPoolExecutor with max_workers set to 1 (this should have the same memory footprint as a simple loop as far as I understand):
from random import random
from concurrent.futures import ThreadPoolExecutor, as_completed
import gc
TOTAL_PAGES = 60
def download_data(page: int = 1) -> list[float]:
# Send a request to some resource to get data
print(f"Downloading page {page}.")
return [random() for _ in range(1000000)] # mock some larga data sets
def threadpool_memory_test():
processed_pages = 0
with ThreadPoolExecutor(max_workers=1) as executor:
future_to_page = {
executor.submit(download_data, page): page for page in range(1, TOTAL_PAGES + 1)
}
for future in as_completed(future_to_page):
records = future.result()
# Do something with the downloaded data..
processed_pages += 1
print(f"Downloaded page: {processed_pages} / {TOTAL_PAGES} (number: {future_to_page[future]}) with {len(records)} records.")
gc.collect() # just to be sure gc is called
if __name__ == "__main__":
threadpool_memory_test()
However when running this script and plotting the memory footprint, it looks like this:
The futures do not release their memory even when looped through with as_completed and obtaining the results.
When I download and process the pages in simple loop. The memory footprint is as expected:
from random import random
TOTAL_PAGES = 60
def download_data(page: int = 1) -> list[float]:
# Send a request to some resource to get data
print(f"Downloading page {page}.")
return [random() for _ in range(1000000)] # mock some larga data sets
def loop_memory_test():
for page in range(1, TOTAL_PAGES + 1):
records = download_data(page)
# Do something with the downloaded data..
print(f"Downloaded page: {page} / {TOTAL_PAGES} with {len(records)} records.")
if __name__ == "__main__":
loop_memory_test()
Memory footprint of such script:

Is there a way how to release memory of a future from which the results were already obtained?
I am testing this on macOs Monterey version 12.5 (21G72)
Based on SIGHUP's comment I updated the script and it now works as expected (it is also 10x faster as well as uses fraction of the memory):
from random import random
from concurrent.futures import ThreadPoolExecutor, as_completed
import gc
TOTAL_PAGES = 60
def download_data(page: int = 1) -> list[float]:
# Send a request to some resource to get data
print(f"Downloading page {page}.")
return [random() for _ in range(1000000)] # mock some larga data sets
def threadpool_memory_test():
processed_pages = 0
with ThreadPoolExecutor(max_workers=1) as executor:
future_to_page = {
executor.submit(download_data, page): page for page in range(1, TOTAL_PAGES + 1)
}
for future in as_completed(future_to_page):
records = future.result()
page = future_to_page.pop(future)
# Do something with the downloaded data..
processed_pages += 1
print(f"Downloaded page: {processed_pages} / {TOTAL_PAGES} (number: {page}) with {len(records)} records.")
gc.collect() # just to be sure gc is called
if __name__ == "__main__":
threadpool_memory_test()
It boils down just to this line:
page = future_to_page.pop(future)
making sure the reference to the future is removed.
The memory footprint now:

Thank you!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With