Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

errors=surrogateescape vs errors=replace

I am trying to open a file like that:

with open("myfile.txt", encoding="utf-8") as f:

but myfile.txt comes from my application's users. And 90% of the times, this file comes as non UTF-8 which causes the application to exit because it failed to read it properly. The error is like 'utf-8' codec can't decode byte 0x9c

I've Googled about it and found some Stackoverflow answers that say to open my file like that:

with open("myfile.txt", encoding="utf-8", errors="surrogateescape") as f:

but other answers said to use:

with open("myfile.txt", encoding="utf-8", errors="replace") as f:

So what is the difference between errors="replace" and errors="surrogateescape" and which one will fix the non UTF-8 bytes in the file?

like image 230
gabugu Avatar asked Oct 29 '25 16:10

gabugu


1 Answers

The doc says:

'replace': Replace with a suitable replacement marker; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in codecs on decoding, and ‘?’ on encoding. Implemented in replace_errors().
...
'surrogateescape': On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data. (See PEP 383 for more.)

That means that with replace, any offending byte will be replaced with the same U+FFFD REPLACEMENT CHARACTER, while with surrogateescape each byte will be replaced with a different value. For example a '\xe9' would be replaced with a '\udce9' and '\xe8' with '\udce8'.

So with replace, you get valid unicode characters, but lose the original content of the file, while with surrogateescape, you can know the original bytes (and can even rebuild it exactly with .encode(errors='surrogateescape')), but your unicode string is incorrect because it contains raw surrogate codes.

Long story short: if the original offending bytes do no matter and you just want to get rid of the error, replace is a good choice, and if you need to keep them for later processing, surrogateescape is the way to go.


surrogateescape has a very nice feature when you have files containing mainly ascii characters and a few (accented) non ascii ones. And you also have users which occasionaly modify the file with a non UTF8 editor (or fail to declare the UTF8 encoding). In that case, you end with a file containing mostly utf8 data and some bytes in a different encoding, often CP1252 for windows users in non English west european language (like French, Portugues of Spanish). In that case it is possible to build a translation table that will map surrogate chars to their equivalent in cp1252 charset:

# first map all surrogates in the range 0xdc80-0xdcff to codes 0x80-0xff
tab0 = str.maketrans(''.join(range(0xdc80, 0xdd00)),
             ''.join(range(0x80, 0x100)))
# then decode all bytes in the range 0x80-0xff as cp1252, and map the undecoded ones
#  to latin1 (using previous transtable)
t = bytes(range(0x80, 0x100)).decode('cp1252', errors='surrogateescape').translate(tab0)
# finally use above string to build a transtable mapping surrogates in the range 0xdc80-0xdcff
#  to their cp1252 equivalent, or latin1 if byte has no value in cp1252 charset
tab = str.maketrans(''.join(chr(i) for i in range(0xdc80, 0xdd00)), t)

You can then decode a file containing a mojibake of utf8 and cp1252:

with open("myfile.txt", encoding="utf-8", errors="surrogateescape") as f:
    for line in f:                     # ok utf8 has been decoded here
        line = line.translate(tab)     # and cp1252 bytes are recovered here

I have successfully used that method several times to recover csv files that were produced as utf8 and had been edited with Excel on Windows machines.

The same method could be used for other charsets derived from ascii

like image 82
Serge Ballesta Avatar answered Nov 01 '25 05:11

Serge Ballesta



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!