Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove a root tag from xml/html using tostring() of lxml

Tags:

python

cdata

lxml

How to make a html text without a root tag (usually it's <html></html>)? To example, for use in CDATA:

<![CDATA[<div class="foo"></div><p>bar</p>]]>

My code:

from lxml import etree

html = etree.Element('root')
etree.SubElement(html, 'div', attrib={'class':'foo'})
etree.SubElement(html, 'p').text='bar'

t = etree.tostring(html)
# '<root><div class="foo"/><p>bar</p></root>'

I would not want to use regex to remove the root tag.

like image 201
bl79 Avatar asked Dec 12 '25 02:12

bl79


1 Answers

If you need the text representation of all subelements without the root element, you can do:

subels = ''.join([etree.tostring(el).decode('ascii') for el in html])

where html is the Element of your question. In this case subels is a string:

'<div class="foo"/><p>bar</p>'

This can be further improved to get only specific tags using the iter method. For example:

subels = ''.join([etree.tostring(el).decode('ascii') for el in html.iter('div', 'p'])

will return only the 'div' and 'p' tags, so if there had be other tags they would have been omitted.
You can use it to filter out unwanted tags, but just be careful because it may broke the document hierarchy: it still returns children tags of undesired tags.

EDIT after comments

If the root tag has a text attibute which you want to keep, just add it back.

subels = ''.join([html.text] + [etree.tostring(el).decode('ascii') for el in html])
like image 185
Valentino Avatar answered Dec 13 '25 16:12

Valentino



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!