I got a strange bug with lxml:
>>> s = '<html><head><noscript></noscript><script></script><meta></head></html>'
>>> root = lxml.html.fromstring(s)
>>> root.xpath('/html/head/meta')
>>> root.xpath('/html/body/meta')
[<Element meta at 0x2a92788>]
meta tag should in head element, not body. How can I get correct element in this situation?
Let me guess: are you using old version of Ubuntu (like 12.04)?
Actually, it's a bug in old version of preinstalled libxml2 library used by lxml package. In the release notes for version 2.8.0 they mention fix for HTML parser error with <noscript> in the <head> - so I guess version of libxml2 >= 2.8.0 should work. Ubuntu 12.04 has version 2.7.8 installed.
>>> import lxml.etree
>>> lxml.etree.LIBXML_COMPILED_VERSION
(2, 7, 8)
>>> lxml.etree.LIBXML_VERSION
(2, 9, 1)
I think if any of these versions are >=2.8.0, the <noscript>
issue should be gone.
This works for me:
import lxml.html
s = '<html><head><noscript></noscript><script></script><meta></head></html>'
root = lxml.html.fromstring(s)
print(root.xpath('/html/head/meta'))
print(root.xpath('/html/body/meta'))
Output:
[<Element meta at 0x10a123b8>]
[]
I'm using Python 2.7.9 and lxml version 3.4.2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With