Path file = Paths.get("New Text Document.txt");
try {
System.out.println(Files.readString(file, StandardCharsets.UTF_8));
System.out.println(Files.readString(file, StandardCharsets.UTF_16));
} catch (Exception e) {
System.out.println("yep it's an exception");
}
might yield
some text
Exception in thread "main" java.lang.Error: java.nio.charset.MalformedInputException: Input length = 1
at java.base/java.lang.String.decodeWithDecoder(String.java:1212)
at java.base/java.lang.String.newStringNoRepl1(String.java:786)
at java.base/java.lang.String.newStringNoRepl(String.java:738)
at java.base/java.lang.System$2.newStringNoRepl(System.java:2390)
at java.base/java.nio.file.Files.readString(Files.java:3369)
at test.Test2.main(Test2.java:13)
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
at java.base/java.lang.String.decodeWithDecoder(String.java:1205)
... 5 more
This error "shouldn't happen". Here's the java.lang.String method:
private static int decodeWithDecoder(CharsetDecoder cd, char[] dst, byte[] src, int offset, int length) {
ByteBuffer bb = ByteBuffer.wrap(src, offset, length);
CharBuffer cb = CharBuffer.wrap(dst, 0, dst.length);
try {
CoderResult cr = cd.decode(bb, cb, true);
if (!cr.isUnderflow())
cr.throwException();
cr = cd.flush(cb);
if (!cr.isUnderflow())
cr.throwException();
} catch (CharacterCodingException x) {
// Substitution is always enabled,
// so this shouldn't happen
throw new Error(x);
}
return cb.position();
}
EDIT: As @user16320675 noted, this happens when an UTF-8 file with an odd number of characters is read as UTF-16. With an even number of characters, neither the Error nor the MalformedInputException happens. Why the Error?
This is a bug introduced in JDK 17.
Prior to this version, this Error throwing code was only used for the String constructor which indeed can never encounter a CharacterCodingException because it configures the decoder to substitute illegal content.
E.g., when you use
String s = new String(new byte[] { 50 }, StandardCharsets.UTF_16);
System.out.println(s.chars()
.mapToObj(c -> String.format(" U+%04x", c)).collect(Collectors.joining("", s, "")));
you’ll get
� U+fffd
In JDK 17, the code has been refactored and code duplication removed. Now, the same method decodeWithDecoder will be used for both, the String constructor and Files.readString. But Files.readString is supposed to report encoding errors instead of substituting the problematic content. Therefore, the decoder has not been configured to substitute malformed content, intentionally.
When you run
Path p = Files.write(Files.createTempFile("charset", "test"), new byte[] { 50 });
try(Closeable c = () -> Files.delete(p)) {
String s = Files.readString(p, StandardCharsets.UTF_16);
}
under JDK 16, you’ll correctly get
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
at java.base/java.lang.StringCoding.newStringNoRepl1(StringCoding.java:1053)
at java.base/java.lang.StringCoding.newStringNoRepl(StringCoding.java:1003)
at java.base/java.lang.System$2.newStringNoRepl(System.java:2265)
at java.base/java.nio.file.Files.readString(Files.java:3353)
at first.test17.CharsetProblem.main(CharsetProblem.java:23)
The now-removed dedicated routine threw the MalformedInputException encapsulated in an IllegalArgumentException. The immediate caller looks like
/*
* Throws CCE, instead of replacing, if unmappable.
*/
static byte[] getBytesNoRepl(String s, Charset cs) throws CharacterCodingException {
try {
return getBytesNoRepl1(s, cs);
} catch (IllegalArgumentException e) {
//getBytesNoRepl1 throws IAE with UnmappableCharacterException or CCE as the cause
Throwable cause = e.getCause();
if (cause instanceof UnmappableCharacterException) {
throw (UnmappableCharacterException)cause;
}
throw (CharacterCodingException)cause;
}
}
and there lies the problem. When the code was refactored to use the same routine for the String constructor and Files.readString, this caller was not adapted. It still expects an IllegalArgumentException where the common method now throws an Error. Or the common method should have been adapted to better suit both cases, e.g. by having a parameter telling whether CharacterCodingException exceptions should be possible or not.
It’s worth noting that the charset decoding code has a lot of optimizations and shortcuts for commonly used charsets. That’s why you rarely get to this specific method. UTF-16 seems to be one (if not the) rare case where this method is used.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With