I have a 25GB file I need to process. Here is what I'm currently doing, but it takes an extremely long time to open:
collection_pricing = os.path.join(pricing_directory, 'collection_price')
with open(collection_pricing, 'r') as f:
    collection_contents = f.readlines()
length_of_file = len(collection_contents)
for num, line in enumerate(collection_contents):
    print '%s / %s' % (num+1, length_of_file)
    cursor.execute(...)
How could I improve this?
Unless the lines in your file is really, really big, do not print the progress at every line. Printing to a terminal is very slow. Print progress e.g. every 100 or every 1000 lines.
Use the available operating system facilities to get the size of a file - os.path.getsize() , see Getting file size in Python?
Get rid of readlines() to avoid reading 25GB into memory. Instead read and process line by line, see e.g. How to read large file, line by line in python
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With