Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding and converting XML processing instructions using Python

We are converting our ancient FrameMaker docs to XML. My job is to convert this:

<?FM MARKER [Index] foo, bar ?>` 

to this:

<indexterm>
    <primary>foo, bar</primary>
</indexterm>

I'm not worried about that part (yet); what is stumping me is that the ProcessingInstructions are all over the documents and could potentially be under any element, so I need to be able to search the entire tree, find them, and then process them. I cannot figure out how to iterate over an entire XML tree using minidom. Am I missing some secret method/iterator? This is what I've looked at thus far:

  • Elementtree has the excellent Element.iter() method, which is a depth-first search, but it doesn't process ProcessingInstructions.

  • ProcessingInstructions don't have tag names, so I cannot search for them using minidom's getElementsByTagName.

  • xml.sax's ContentHandler.processingInstruction looks like it's only used to create ProcessingInstructions.

Short of creating my own depth-first search algorithm, is there a way to generate a list of ProcessingInstructions in an XML file, or identify their parents?

like image 485
Pantagrool Avatar asked Jan 29 '26 17:01

Pantagrool


1 Answers

Use the XPath API of the lxml module as such:

from lxml import etree

foo = StringIO('<foo><bar></bar></foo>')
tree = etree.parse(foo)
result = tree.xpath('//processing-instruction()')

The node test processing-instruction() is true for any processing instruction. The processing-instruction() test may have an argument that is Literal; in this case, it is true for any processing instruction that has a name equal to the value of the Literal.

References

  • XPath and XSLT with lxml
  • XML Path Language 1.0: Node Tests
like image 173
Paul Sweatte Avatar answered Feb 01 '26 07:02

Paul Sweatte



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!