Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing and extracting information from large HTML files with python and lxml

Tags:

python

html

xpath

I would like to parse large HTML files and extract information from those files through xpath. Aiming to do that, I'm using python and lxml. However, lxml seems not to work well with large files, it can parse correctly files whose size isn't larger than around 16 MB. The fragment of code where it tries to extract information from HTML code though xpath is the following:

tree = lxml.html.fragment_fromstring(htmlCode)
links = tree.xpath("//*[contains(@id, 'item')]/div/div[2]/p/text()")

The variable htmlCode contains the HTML code read from a file. I also tried using parse method for reading the code from file instead of getting the code directly from a string, but it didn't work either. As the contents of file is read successfully from file, I think the problem is related to lxml. I've been looking for another libraries in order to parse HTML and use xpath, but it looks like lxml is the main library used for that.

Is there another method/function of lxml that deals better with large HTML files?

like image 938
user12707 Avatar asked Dec 15 '25 13:12

user12707


1 Answers

If the file is very large, you can use iterparse and add html=True argument to parse files without any validation. You need to manually create conditions for xpath.

from lxml import etree
import sys
import unicodedata

TAG = '{http://www.mediawiki.org/xml/export-0.8/}text'

def fast_iter(context, func, *args, **kwargs):
    # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly
    # modified to call func() only in the event and elem needed
    for event, elem in context:
        if event == 'end' and elem.tag == TAG:
            func(elem, *args, **kwargs)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def process_element(elem, fout):
    global counter
    normalized = unicodedata.normalize('NFKD', \
            unicode(elem.text)).encode('ASCII','ignore').lower()
    print >>fout, normalized.replace('\n', ' ')
    if counter % 10000 == 0: print "Doc " + str(counter)
    counter += 1

def main():
    fin = open("large_file", 'r')
    fout = open('output.txt', 'w')
    context = etree.iterparse(fin,html=True)
    global counter
    counter = 0
    fast_iter(context, process_element, fout)

if __name__ == "__main__":
main()

Source

like image 79
mudit Avatar answered Dec 17 '25 05:12

mudit



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!