I try to convert an HTML page into a tree structure. I have derived this class (I removed what I actually do with each tag as it's not relevant) :
class PageParser(html.parser.HTMLParser):
def handle_starttag(self, tag, attrs):
print("start "+tag)
def handle_endtag(self, tag):
print("end "+tag)
def handle_startendtag(self, tag, attrs):
print("startend "+tag)
I expected empty elements to trigger the handle_startendtag
method. The problem is that, when encountering an empty element like <meta>
, only the handle_starttag
method is called. The tag is never closed from my class' point of view :
parser = PageParser()
parser.feed('<div> <meta charset="utf-8"> </div>')
prints :
start div
start meta
end div
I need to know when each element has been closed to correctly create the tree. How can I know if a tag is an empty element ?
Checking the documentation, and specifically this example:
Parsing an element with a few attributes and a title:
>>>parser.feed('<img src="python-logo.png" alt="The Python logo">') Start tag: img attr: ('src', 'python-logo.png') attr: ('alt', 'The Python logo')
We can determine that this is the expected behavior.
The best suggestion come from @HenryHeath 's comment: Use BeautifulSoup.
If you don't want to use it though, you can work around the expected behavior of HTMLParser
as follows:
Create a list with those element names:
void_elements = ['area', 'base', ... , 'wbr']
In handle_starttag
check if the tag is in the void_elements
list:
class PageParser(html.parser.HTMLParser):
def handle_starttag(self, tag, attrs):
if tag in void_elements:
# DO what should happen inside handle_startendtag()
print("void element "+tag)
else:
print("start "+tag)
Good luck :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With