I have a node like
<a class="someclass">
Wie
<em>Messi</em>
einen kleinen Jungen stehen lässt
</a>
How do I construct an XPath to get ["Wie Messi einen kleinen Jungen stehen lässt"]
instead of ["Wie","Messi","einen kleinen Jungen stehen lässt"]
?
I am using python lxml.html function with XPath.
Tried combinations
//a/node()/text()
//a/descendant::*/text()
//a/text()
But it didn't help. Any solutions?
I was thinking of another approach where I somehow get the "inner html" of the <a>
element (which in the above case will be "Wie <em>Messi</em> einen kleinen Jungen stehen lässt"
) and remove the <em>
tags from the html.
Still trying to figure out how to get innerhtml (Javascript, anyone?) from XPath.
XPath is a selection language, so what it can do is select nodes. If there are separate nodes in the input then you will get a list of separate nodes as the selection result.
You'll need the help of your host language - Python in this case - to do things beyond that scope (like, merging text nodes into a singe string).
You need to find all <a>
elements and join their individual text descendants. That's easy enough to do:
from lxml import etree
doc = etree.parse("path/to/file")
for a in doc.xpath("//a"):
print " ".join([t.strip() for t in a.itertext()])
prints
Wie Messi einen kleinen Jungen stehen lässt
As paul correctly points out in the comments below, you can use XPath's normalize-space()
and the whole thing gets even simpler.
for a in doc.xpath("//a"):
print a.xpath("normalize-space()")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With