Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to decode escaped Unicode characters?

I'm trying to replace escaped Unicode characters with the actual characters:

string = "\\u00c3\\u00a4"
print(string.encode().decode("unicode-escape"))

The expected output is ä, the actual output is ä.

like image 606
Toast Avatar asked Oct 23 '25 10:10

Toast


2 Answers

The following solution seems to work in similar situations (see for example this case about decoding broken Hebrew text):

("\\u00c3\\u00a4"
  .encode('latin-1')
  .decode('unicode_escape')
  .encode('latin-1')
  .decode('utf-8')
)

Outputs:

'ä'

This works as follows:

  • The string that contains only ascii-characters '\', 'u', '0', '0', 'c', etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly)
  • Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1/'latin-1', so...
  • encode it again with 'latin-1'
  • Decode it "properly" this time, as UTF-8

Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way. Not breaking it in the first place is better than breaking it and then repairing it again.

like image 62
Andrey Tyukin Avatar answered Oct 24 '25 23:10

Andrey Tyukin


The codecs doc page states:

enter image description here

That means that output of the "unicode-escape" will be latin1, even if the default for python is utf-8.
So, you just need to encode back to latin1 and decode back to utf-8

mixed_string_to_be_unescaped =  '\u002Fq:85\\u002FczM"},{\"name\":\"Santé\",\"parent_name\":\"Santé'

val = codecs.decode(mixed_string_to_be_unescaped, 'unicode-escape')
val = val.encode('latin1').decode('utf-8')
print(val)

/q:85/czM"},{"name":"Santé","parent_name":"Santé

The above solution works, but to me was not clear because I didn't get why I should convert to latin-1 before the unicode_escape (discovered that was doing this automatically), neither why it was using unicode_escape in an unescaped string.

like image 25
Daniele Rugginenti Avatar answered Oct 24 '25 23:10

Daniele Rugginenti



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!