I'm maintaining a back-end service in Java and I have the following method of Java 8 code that's used to validate the input to my service API:
private static boolean containsDisallowedChars(String toValidate) {
return !StandardCharsets.US_ASCII.newEncoder().canEncode(toValidate);
}
I'm expanding it to support Hindi and other non-English characters, so I've changed it from ASCII to UTF-8, as follows:
private static boolean containsDisallowedChars(String toValidate) {
return !StandardCharsets.UTF_8.newEncoder().canEncode(toValidate);
}
Now I'm trying to update the corresponding unit test to pass in a String toValidate that will cause this method to return false.
How can I make a Java String that contains contents that can't be encoded to UTF-8?
I tried this test setup
// ref https://stackoverflow.com/questions/1301402/example-invalid-utf8-string
// test data byte values https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
// 3.5 Impossible bytes
// The following two bytes cannot appear in a correct UTF-8 string
// 3.5.1 fe = "�"
// 3.5.2 ff = "�"
// 3.5.3 fe fe ff ff = "����"
final byte[] bytes = {(byte)0xfe, (byte)0xfe, (byte)0xff, (byte)0xff};
log.info("bytes={}", bytes);
final String s = new String(bytes);
log.info("s={}", s);
log.info("s.length={}", s.length());
log.info("s.bytes={}", s.getBytes());
StandardCharsets.UTF_8.newEncoder().canEncode(s) returns true and the log output shows that the String class constructor is changing the byte array as follows:
bytes=[-2, -2, -1, -1]
s=����
s.length=4
s.bytes=[-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67]
I tried several variations on this with similar results using other invalid UTF-8 byte arrays described in https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
It seems as if the String class is robustly creating valid UTF-8 strings despite my efforts to supply invalid byte arrays.
I tried Base64 as suggested here How can I generate non-UTF-8 string / char in Java for testing purposes?
final byte[] bytes = {(byte)0xfe, (byte)0xfe, (byte)0xff, (byte)0xff};
log.info("bytes={}", bytes);
final String s = new String(Base64.getEncoder().encode(bytes));
log.info("s={}", s);
log.info("s.length={}", s.length());
log.info("s.bytes={}", s.getBytes());
Base64.getEncoder().encode doesn't return string. It returns byte[]. Therefore I must still call new String(byte[]) which changes the byte array to a valid UTF-8 byte array. StandardCharsets.UTF_8.newEncoder().canEncode still returns true and I get this log output:
bytes=[-2, -2, -1, -1]
s=/v7//w==
s.length=8
s.bytes=[47, 118, 55, 47, 47, 119, 61, 61]
Is it possible to create a Java String object that contains a string that can't be encoded as UTF-8? If not, does it mean my containsDisallowedChars method is unnecessary since it can never return true? Or is there a different validation approach I should consider instead of StandardCharsets.UTF_8.newEncoder().canEncode?
In your question, you noted:
It seems as if the String class is robustly creating valid UTF-8 strings despite my efforts to supply invalid byte arrays.
If you want to test a byte array to see if it is valid for a specific encoding, then you can use CharsetDecoder (not CharsetEncoder).
The CharsetDecoder can:
transform a sequence of bytes in a specific charset into a sequence of sixteen-bit Unicode characters.
If you pass the decode() method a ByteBuffer, you can use use it as follows:
private static boolean testBytes(byte[] bytes) {
boolean isValid = true;
try {
StandardCharsets.UTF_8.newDecoder().decode(ByteBuffer.wrap(bytes));
} catch (CharacterCodingException ex) {
//Logger.getLogger(MyTester.class.getName()).log(Level.SEVERE, null, ex);
isValid = false;
}
return isValid;
}
So, for example, the following will print false because 0xFF is not a valid UTF-8 byte sequence.
byte[] b = HexFormat.of().parseHex("ff");
System.out.println(testBytes(b));
Your example {(byte)0xfe, (byte)0xfe, (byte)0xff, (byte)0xff} will also return false.
In your question, you asked:
Is it possible to create a Java String object that contains a string that can't be encoded as UTF-8?
By the time you have created a Java String, it's "too late" to check because, as you have seen, any unsupported byte sequences have already been replaced by the Unicode replacement character - which is itself a valid character in a Java string (the Java String object itself "represents a string in the UTF-16 format" - and both UTF-8 and UTF-16 cover all valid Unicode code points).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With