Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python HTMLParser

I'm parsing a html document using HTMLParser and I want to print the contents between the start and end of a p tag

See my code snippet

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            print "TODO: print the contents"
like image 425
Ruth Avatar asked Jan 28 '26 04:01

Ruth


1 Answers

Based on what @tauran posted, you probably want to do something like this:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def print_p_contents(self, html):
        self.tag_stack = []
        self.feed(html)

    def handle_starttag(self, tag, attrs):
        self.tag_stack.append(tag.lower())

    def handle_endtag(self, tag):
        self.tag_stack.pop()

    def handle_data(self, data):
        if self.tag_stack[-1] == 'p':
            print data

p = MyHTMLParser()
p.print_p_contents('<p>test</p>')

Now, you might want to push all <p> contents into a list and return that as a result or something else like that.

TIL: when working with libraries like this, you need to think in stacks!

like image 162
Daren Thomas Avatar answered Jan 30 '26 17:01

Daren Thomas