Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python lxml error "namespace not defined."

I am being driven crazy by some oddly formed xml and would be grateful for some pointers:

The documents are defined like this:

<sphinx:document id="18059090929806848187">
  <url>http://www.some-website.com</url>
  <page_number>104</page_number>
  <size>7865</size>
</sphinx:document>

Now, I need to read lots (500m+ of these files which are all gz compresed) and grab the text values form a few of the contained tags.

sample code:

from lxml import objectify, etree
import gzip

with open ('file_list','rb') as file_list:
 for file in file_list:
  in_xml = gzip.open(file.strip('\n'))
  xml2 = etree.iterparse(in_xml)
  for action, elem in xml2:
   if elem.tag == "page_number":
    print elem.text + str(file)

the first value elem.text is returned but only for the first file in the list and quickly followed by the error:

lxml.etree.XMLSyntaxError: Namespace prefix sphinx on document is not defined, line 1, column 20

Please excuse my ignorance but xml really hurts my head and I have been struggling with this for a while. Is there a way that I can either define the namespace prefix or handle this in some other more intelligent manner?

Thanks

like image 377
RJJ Avatar asked Jan 23 '26 17:01

RJJ


1 Answers

Your input file is not well formed XML. I assume that it is a snippet from a larger XML document.

Your choices are:

  • Reconstruct the larger document. How you do this is specific to your application. You may have to consult with the people that created the file you are parsing.

  • Parse the file in spite of its errors. To do that, use the recover keyword from lxml.etree.iterparse:

    xml2 =etree.iterparse(in_xml, recover=True)
    
like image 144
Robᵩ Avatar answered Jan 25 '26 07:01

Robᵩ



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!