Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python parse XML files with HTML content

Tags:

python

xml

I use an API to get some XML files but some of them contain HTML tags without escaping them. For example, <br> or <b></b>

I use this code to read them, but the files with the HTML raise an error. I don't have access to change manually all the files. Is there any way to parse the file without losing the HTML tags?

from xml.dom.minidom import parse, parseString

xml = ...#here is the api to receive the xml file
dom = parse(xml)
strings = dom.getElementsByTagName("string")
like image 296
Tasos Avatar asked Nov 17 '25 19:11

Tasos


1 Answers

Read the xml file as a string, and fix the malformed tags before you parse it:

import xml.etree.ElementTree as ET

with open(xml) as xml_file: # open the xml file for reading
    text= xml_file.read() # read its contents
text= text.replace('<br>', '<br />') # fix malformed tags
document= ET.fromstring(text) # parse the string
strings= document.findall('string') # find all string elements
like image 55
Aran-Fey Avatar answered Nov 19 '25 10:11

Aran-Fey



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!