Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python list-comprehension too slow

I have 231 pdf files and would like to convert each of them to a string format. Subsequently, I will save each of these strings to in a txt file.

I was able to create a code for this (I checked that it works when I ran the code for a smaller number of elements), but python did not finish executing the program even after 10h!

I tried the same code using "for loop", but it is too slower too. Any idea how could I make this code faster?

Here is my code:

from pdfminer.pdfinterp import PDFResourceManager, 
PDFPageInterpreter#process_pdf
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams

from io import StringIO

def pdf_to_text(pdfname):

    # PDFMiner boilerplate
    rsrcmgr = PDFResourceManager()
    sio = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, sio, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    # Extract text
    fp = open(pdfname, 'rb')
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
    fp.close()

    # Get text from StringIO
    text = sio.getvalue()

    # Cleanup
    device.close()
    sio.close()

    return text

lista2 = [pdf_to_text(k) for k in lista1]

Where lista1 is the list with my 231 pdfs

The pdf files were extract from this website. I selected only the files with the word "Livro" in the name.

like image 618
Lucas Avatar asked Mar 20 '26 21:03

Lucas


1 Answers

This is one of the great use cases for generators: conserving memory.

Often, all you need to do is iterate over the files, transforming one at a time and streaming the output somewhere else. Say, for example:

for f in files:
   text = pdf_to_text(f)
   output.write(text)

-- then you don't want (or need) a list comprehension, in fact you never need to create a list at all. Instead, consider just iterating over the elements one at a time. Or create a generator, if that makes more sense.

Keep in mind that the garbage collector cannot release memory if you still have a reference to it. If you create a list comprehension then all of the elements in it (and items those elements reference) must be preserved in memory all at one time. Usually you only need this if you plan to access the elements frequently or in a non-linear order.

You should also consider the possibility that processing large files even if you can do allocate/transform/deallocate may still be "too slow" if we're talking about many gigabytes worth being read/written. In this case the best alternative is often considering using a C extensions that will provide better control over how memory is allocated and used. Also, pypy works in the vast majority of cases and is usually much faster than CPython.

like image 92
Brian Cain Avatar answered Mar 23 '26 11:03

Brian Cain