Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory issues with splitting lines in huge files in Python

Tags:

python

I'm trying to read from disk a huge file (~2GB) and split each line into multiple strings:

def get_split_lines(file_path):
    with open(file_path, 'r') as f:
        split_lines = [line.rstrip().split() for line in f]
    return split_lines

Problem is, it tries to allocate tens and tens of GB in memory. I found out that it doesn't happen if I change my code in the following way:

def get_split_lines(file_path):
    with open(file_path, 'r') as f:
        split_lines = [line.rstrip() for line in f]    # no splitting
    return split_lines

I.e., if I do not split the lines, memory usage drastically goes down. Is there any way to handle this problem, maybe some smart way to store split lines without filling up the main memory?

Thank you for your time.

like image 904
zhed Avatar asked Oct 27 '25 09:10

zhed


1 Answers

After the split, you have multiple objects: a tuple plus some number of string objects. Each object has its own overhead in addition to the actual set of characters that make up the original string.

Rather than reading the entire file into memory, use a generator.

def get_split_lines(file_path):
    with open(file_path, 'r') as f:
        for line in f:
            yield line.rstrip.split()

for t in get_split_lines(file_path):
    # Do something with the tuple t 

This does not preclude you from writing something like

lines = list(get_split_lines(file_path))

if you really need to read the entire file into memory.

like image 76
chepner Avatar answered Oct 29 '25 23:10

chepner



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!