Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fixing incorrectly ISO-8859-1 decoded UTF-8 string in Java

I have to deal with a library which is not in my control. It delivers a string which it decoded from a byte stream with ISO-8859-1. However the byte stream is UTF-8. So obviously the resulting string I get is wrong if it contains non ASCII characters.

So what I do to fix this is to convert the string back to the byte stream and decode it again with UTF-8. Like this:

byte[] raw = inputText.getBytes(StandardCharsets.ISO_8859_1);
String correctedText = new String(raw, StandardCharsets.UTF_8);

I tested it with many examples and it seems to work. Is this always correct however or are there cases where this would not work? In other words: are there cases where decoding / reencoding any arbitrary byte array with ISO-8859-1 would not result in the original byte array?

like image 224
nharrer Avatar asked Nov 16 '25 12:11

nharrer


1 Answers

Since ISO-8859-1 is a 1 byte per character encoding, it will always work. The UTF-8 bytes are converted to incorrect characters, but luckily there's no information lost.

Changing the characters back to bytes using ISO-8859-1 encoding gives you the original byte array, containing characters encoded in UTF-8, so you can then safely reinterpret it with the correct encoding.

The opposite of this is not (always¹) true, as UTF-8 is a multibyte encoding. The encoding process may encounter invalid byte sequences and replace them with the replacement character ?. At that point you've lost information and can't get the original bytes back anymore.

¹ If you stick to characters in the 0-127 range it will work, as they're encoded in UTF-8 using a single byte.

like image 66
Kayaman Avatar answered Nov 19 '25 03:11

Kayaman



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!