Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python/lxml is eating too much memory

The program is quite simple, recursively descend to directories and extract an element. The directories are 1k with about 200 files of 0.5m. I see that it consumes about 2.5g of memory after some time, it's completely unacceptable, the script's not alone to eat up everything. I cannot understand why it doesn't release the memory. Explicit del doesn't help. Are there any techniques to consider?


from lxml import etree
import os

res=set()
for root, dirs, files in os.walk(basedir):
    for i in files:
        tree = etree.parse(os.path.join(root,i), parser)
        for i in tree.xpath("//a[@class='ctitle']/@href"):
            res.add(i)
        del tree
like image 933
aikipooh Avatar asked Oct 28 '25 10:10

aikipooh


1 Answers

You're keeping references to an element from the tree, an _ElementUnicodeResult. The element keeps references to its parent. This prevents the whole tree from being garbage collected.

Try converting the element to a string and store that:

from lxml import etree
import os

titles = set()
for root, dirs, files in os.walk(basedir):
    for filename in files:
        tree = etree.parse(os.path.join(root, filename), parser)
        for title in tree.xpath("//a[@class='ctitle']/@href"):
            titles.add(str(title))
like image 54
Peter Wood Avatar answered Oct 30 '25 01:10

Peter Wood



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!