Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml etree.parse MemoryAllocation Error

Tags:

python

lxml

I'm using lxml etree.parse to parse a, somehow, huge XML file (around 65MB - 300MB). When I run my stand alone python script containing the below function, I am getting a Memory Allocation failure:

Error:

     Memory allocation failed : xmlSAX2Characters, line 5350155, column 16

Partial function code:

def getID():
        try:
            from lxml import etree
            xml = etree.parse(<xml_file>)  # here is where the failure occurs
            for element in xml.iter():
                   ...
                   result = <formed by concatenating element texts>
            return result
        except Exception, ex:
            <handle exception>

The weird thing is when I input the same function on IDLE, and tested the same XML file, I am not encountering any MemoryAllocation error.

Any ideas on this issue? Thanks in advance.

like image 994
jaysonpryde Avatar asked Jan 19 '26 12:01

jaysonpryde


1 Answers

I would parse the document using the iterative parser instead, calling .clear() on any element you are done with; that way you avoid having to load the whole document in memory in one go.

You can limit the iterative parser to only those tags you are interested in. If you only want to parse <person> tags, tell your parser so:

for _, element in etree.iterparse(input, tag='person'):
    # process your person data
    element.clear()

By clearing the element in the loop, you free it from memory.

like image 147
Martijn Pieters Avatar answered Jan 22 '26 01:01

Martijn Pieters