I am using python 2.7 with latest lxml library. I am parsing a large XML file with very homogenous structure and millions of elements. I thought lxml's iterparse would not build an internal tree while it parses, but apparently it does since memory usage grows until it crashes (around 1GB). Is there a way to parse large XML file using lxml without using a lot of memory?
I saw the target parser interface as one possibility, but I'm not sure if that will work any better.
Try using Liza Daly's fast_iter:
def fast_iter(context, func, args=[], kwargs={}):
# http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
# Author: Liza Daly
for event, elem in context:
func(elem, *args, **kwargs)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
fast_iter removes elements from the tree after they have been parsed, and also previous elements (maybe with other tags) that are no longer needed.
It could be used like this:
import lxml.etree as ET
def process_element(elem):
...
context=ET.iterparse(filename, events=('end',), tag=...)
fast_iter(context, process_element)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With