Fixing incorrectly ISO-8859-1 decoded UTF-8 string in Java

Question

I have to deal with a library which is not in my control. It delivers a string which it decoded from a byte stream with ISO-8859-1. However the byte stream is UTF-8. So obviously the resulting string I get is wrong if it contains non ASCII characters.

So what I do to fix this is to convert the string back to the byte stream and decode it again with UTF-8. Like this:

byte[] raw = inputText.getBytes(StandardCharsets.ISO_8859_1);
String correctedText = new String(raw, StandardCharsets.UTF_8);

I tested it with many examples and it seems to work. Is this always correct however or are there cases where this would not work? In other words: are there cases where decoding / reencoding any arbitrary byte array with ISO-8859-1 would not result in the original byte array?

Kayaman · Accepted Answer

Since ISO-8859-1 is a 1 byte per character encoding, it will always work. The UTF-8 bytes are converted to incorrect characters, but luckily there's no information lost.

Changing the characters back to bytes using ISO-8859-1 encoding gives you the original byte array, containing characters encoded in UTF-8, so you can then safely reinterpret it with the correct encoding.

The opposite of this is not (always¹) true, as UTF-8 is a multibyte encoding. The encoding process may encounter invalid byte sequences and replace them with the replacement character ?. At that point you've lost information and can't get the original bytes back anymore.

¹ If you stick to characters in the 0-127 range it will work, as they're encoded in UTF-8 using a single byte.

Fixing incorrectly ISO-8859-1 decoded UTF-8 string in Java

Tags:

java

character-encoding

utf-8

iso-8859-1

nharrer

1 Answers

Kayaman

Recent Activity

Donate For Us

Fixing incorrectly ISO-8859-1 decoded UTF-8 string in Java

Tags:

java

character-encoding

utf-8

iso-8859-1

nharrer

1 Answers

Kayaman

Related questions

Recent Activity

Donate For Us