What does it mean to say "Java Modified UTF-8 Encoding" ? How is it different from normal UTF-8 Encoding?
This is described in detail in the javadoc of DataInput:
Modified UTF-8
Implementations of the
DataInputandDataOutputinterfaces represent Unicode strings in a format that is a slight modification of UTF-8. (For information regarding the standard UTF-8 format, see section 3.9 Unicode Encoding Forms of The Unicode Standard, Version 4.0). Note that in the following tables, the most significant bit appears in the far left-hand column.... (some tables, please click the javadoc link to see yourself) ...
The differences between this format and the standard UTF-8 format are the following:
- The null byte
'\u0000'is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls.- Only the 1-byte, 2-byte, and 3-byte formats are used.
- Supplementary characters are represented in the form of surrogate pairs.
How to read it is described in detail in the javadoc of DataInput#readUTF():
readUTF
String readUTF() throws IOExceptionReads in a string that has been encoded using a modified UTF-8 format. The general contract of
readUTFis that it reads a representation of a Unicode character string encoded in modified UTF-8 format; this string of characters is then returned as aString.First, two bytes are read and used to construct an unsigned 16-bit integer in exactly the manner of the
readUnsignedShortmethod . This integer value is called the UTF length and specifies the number of additional bytes to be read. These bytes are then converted to characters by considering them in groups. The length of each group is computed from the value of the first byte of the group. The byte following a group, if any, is the first byte of the next group.If the first byte of a group matches the bit pattern
0xxxxxxx(wherexmeans "may be0or1"), then the group consists of just that byte. The byte is zero-extended to form a character.If the first byte of a group matches the bit pattern
110xxxxx, then the group consists of that byteaand a second byteb. If there is no byteb(because byteawas the last of the bytes to be read), or if bytebdoes not match the bit pattern10xxxxxx, then aUTFDataFormatExceptionis thrown. Otherwise, the group is converted to the character:(char)(((a& 0x1F) << 6) | (b & 0x3F))If the first byte of a group matches the bit pattern
1110xxxx, then the group consists of that byteaand two more bytesbandc. If there is no bytec(because byteawas one of the last two of the bytes to be read), or either bytebor bytecdoes not match the bit pattern10xxxxxx, then aUTFDataFormatExceptionis thrown. Otherwise, the group is converted to the character:(char)(((a & 0x0F) << 12) | ((b & 0x3F) << 6) | (c & 0x3F))If the first byte of a group matches the pattern
1111xxxxor the pattern10xxxxxx, then aUTFDataFormatExceptionis thrown.If end of file is encountered at any time during this entire process, then an
EOFExceptionis thrown.After every group has been converted to a character by this process, the characters are gathered, in the same order in which their corresponding groups were read from the input stream, to form a
String, which is returned.The
writeUTFmethod of interfaceDataOutputmay be used to write data that is suitable for reading by this method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With