The code snippet:
public static void main(String[] args) {
String s = "qwertyuiop";
System.out.println(Arrays.toString(Charset
.forName("UTF-8")
.encode(s)
.array()));
}
Prints:
[113, 119, 101, 114, 116, 121, 117, 105, 111, 112, 0]
That seems to happen because, under the hood, averageBytesPerChar variable appears to be 1.1 for UTF-8 inside java.nio.charset.CharsetEncoder class. Hence it allocates 11 bytes instead of 10 and, provided the input string contains only good old single byte chars, I get that odd null character in the end.
I wonder if this is documented anywhere?
This page:
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html#encode(java.lang.String)
Doesn't give a clue about such behaviour.
P. S. Do I get it right that in any case the snippet above would better be replaced by:
s.getBytes(StandardCharsets.UTF_8)
Which as I see from its source also trims the result in order to avoid those null chars?
Then, what the java.nio.charset.Charset's encode(String s) is supposed to be for?
Charset.encode(), but Buffer.array().If you printed Charset.forName("UTF-8").encode(s), you will find the output to be
java.nio.HeapByteBuffer[pos=0 lim=10 cap=11]
The ByteBuffer has limit 10, the length of the string, and capacity 11, the total allocated size of the buffer. If you change the encoding the limit and capacity may have even wilder variation, e.g.
System.out.println(Charset.forName("UTF-16").encode(s));
// java.nio.HeapByteBuffer[pos=0 lim=22 cap=41]
// (2 extra bytes because of the BOM, not null-termination)
When you call .array(), it will return the whole backing array, so even stuff beyond the limit will be included.
The actual method to extract a Java byte array is through the .get() method:
ByteBuffer buf = Charset.forName("UTF-8").encode(s);
byte[] encoded = new byte[buf.limit()];
buf.get(encoded);
System.out.println(Arrays.toString(encoded));
Well this looks like a mess? Because "nio" means Native I/O. The Buffer type is created so that it can easily wrap a C array. It makes interacting with native code such as reading/writing file or sending/receiving network data very efficient. These NIO APIs typically take a Buffer directly, without constructing any byte[] in between. If you are only working with Buffer, the middle two lines do not need to exist :).
If the whole operation stays within Java, yes just call s.getBytes(StandardCharsets.UTF_8).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With