Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Python json.dumps fail on mixed utf-8 & unicode strings?

Python (2.x) builtin json library supports encoding both unicode & utf-8 encoded (non-ASCII) strings - but apparently not at the same time. Try:

import json; json.dumps([u'Ä', u'Ä'.encode("utf-8")], ensure_ascii=False)

and see it raise a UnicodeDecodeError. Whereas both:

json.dumps([u'Ä'], ensure_ascii=False)

and

json.dumps([u'Ä'.encode("utf-8")], ensure_ascii=False)

...work ok.

Why does JSON encoding of data with both unicode & utf-8 encoded (non-ASCII) strings produce an UnicodeDecodeError? My Python site encoding is ASCII.

like image 239
Petri Avatar asked Dec 27 '25 18:12

Petri


1 Answers

It doesn't work because it doesn't know what kind of output string to produce.

In my Python 2.7:

>>> json.dumps([u'Ä'], ensure_ascii=False)
u'["\xc4"]'

(a Unicode string)

and

>>> json.dumps([u'Ä'.encode("utf-8")], ensure_ascii=False)
'["\xc3\x84"]'

(a UTF8-encoded byte string)

So if you give it UTF8-encoded byte strings, it produces a UTF8-encoded byte string JSON, and if you give it Unicode strings, it produces a Unicode JSON.

If you mix them, it can't do both.

To fix this, you can give an explicit encoding argument (even though the default is correct) and it seems that it makes the result a unicode string always then:

>>> import json; json.dumps([u'Ä', u'Ä'.encode("utf-8")], ensure_ascii=False, encoding="UTF8")
u'["\xc4", "\xc4"]'
like image 98
RemcoGerlich Avatar answered Dec 30 '25 08:12

RemcoGerlich



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!