Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java + xml: libraries handle encoding from the <?xml ?> header?

Tags:

java

xml

encoding

I'm so used to using <?xml version="1.0" encoding="UTF-8"?> that it didn't occur until now that there might be some subtleties with other encodings using the standard Java XML libraries (SAX, DOM, STaX)...

Do these libraries automatically handle the encoding attribute in the header when reading XML documents? If so, where is this documented? (It's not in DocumentBuilder or DocumentBuilderFactory) If not, what do I have to do to make it work out OK for different encodings?

like image 561
Jason S Avatar asked Nov 18 '25 01:11

Jason S


1 Answers

DocumentBuilder uses the SAX API to provide the document to the implementation for parsing (though the implementation might not actually use a SAX parser), and the Javadoc for SAX's org.xml.sax.InputSource says what it does with the header.

The SAX parser will use the InputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly, disregarding any text encoding declaration found in that stream. If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification. If neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.

So interesting cases could include an XML stream supplied via HTTP, with an HTTP Content-Type header that conflicts with the XML's encoding declaration.

like image 100
artbristol Avatar answered Nov 19 '25 16:11

artbristol



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!