python url unquote followed by unicode decode

Question

I have a unicode string like '%C3%A7%C3%B6asd+fjkls%25asd' and I want to decode this string.
I used urllib.unquote_plus(str) but it works wrong.

expected : çöasd+fjkls%asd
result : Ã§Ã¶asd fjkls%asd

double coded utf-8 characters(%C3%A7 and %C3%B6) are decoded wrong.
My python version is 2.7 under a linux distro. What is the best way to get expected result?

John Machin · Accepted Answer

You have 3 or 4 or 5 problems ... but repr() and unicodedata.name() are your friends; they unambiguously show you exactly what you have got, without the confusion engendered by people with different console encodings communicating the results of print fubar.

Summary: either (a) you start with a unicode object and apply the unquote function to that or (b) you start off with a str object and your console encoding is not UTF-8.

If as you say you start off with a unicode object:

>>> s0 = u'%C3%A7%C3%B6asd+fjkls%25asd'
>>> print repr(s0)
u'%C3%A7%C3%B6asd+fjkls%25asd'

this is an accidental nonsense. If you apply urllibX.unquote_YYYY() to it, you get another nonsense unicode object (u'\xc3\xa7\xc3\xb6asd+fjkls%asd') which would cause your shown symptoms when printed. You should convert your original unicode object to a str object immediately:

>>> s1 = s0.encode('ascii')
>>> print repr(s1)
'%C3%A7%C3%B6asd+fjkls%25asd'

then you should unquote it:

>>> import urllib2
>>> s2 = urllib2.unquote(s1)
>>> print repr(s2)
'\xc3\xa7\xc3\xb6asd+fjkls%asd'

Looking at the first 4 bytes of that, it's encoded in UTF-8. If you do print s2, it will look OK if your console is expecting UTF-8, but if it's expecting ISO-8859-1 (aka latin1) you'll see your symptomatic rubbish (first char will be A-tilde). Let's park that thought for a moment and convert it to a Unicode object:

>>> s3 = s2.decode('utf8')
>>> print repr(s3)
u'\xe7\xf6asd+fjkls%asd'

and inspect it to see what we've actually got:

>>> import unicodedata
>>> for c in s3[:6]:
...     print repr(c), unicodedata.name(c)
...
u'\xe7' LATIN SMALL LETTER C WITH CEDILLA
u'\xf6' LATIN SMALL LETTER O WITH DIAERESIS
u'a' LATIN SMALL LETTER A
u's' LATIN SMALL LETTER S
u'd' LATIN SMALL LETTER D
u'+' PLUS SIGN

Looks like what you said you expected. Now we come to the question of displaying it on your console. Note: don't freak out when you see "cp850"; I'm doing this portably and just happen to be doing this in a Command Prompt on Windows.

>>> import sys
>>> sys.stdout.encoding
'cp850'
>>> print s3
çöasd+fjkls%asd

Note: the unicode object was explicitly encoded using sys.stdout.encoding. Fortunately all the unicode characters in s3 are representable in that encoding (and cp1252 and latin1).

python url unquote followed by unicode decode

Tags:

python-unicode

url-encoding

user637287

1 Answers

John Machin

Recent Activity

Donate For Us

python url unquote followed by unicode decode

Tags:

python-unicode

url-encoding

user637287

1 Answers

John Machin

Related questions

Recent Activity

Donate For Us