I've observed the following:
>>> print '£' + '1'
£1
>>> print '£' + u'1'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
>>> print u'£' + u'1'
£1
>>> print u'£' + '1'
£1
Why does '£' + '1' work but '£' + u'1' doesn't work?
I looked at the types:
>>> type('£' + '1')
<type 'str'>
>>> type('£' + u'1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
>>> type(u'£' + u'1')
<type 'unicode'>
This also confuses me. If '£' + '1' is a str and not a unicode, why does it print properly on my terminal? Shouldn't it print something like '\xc2\xa31'?
To add to the mix, I've also observed the following:
>>> u'£' + '1'
u'\xa31'
>>> type('1')
<type 'str'>
>>> type(u'£')
<type 'unicode'>
>>> print u'£' + '1'
£1
Why does u'£' + '1' not print out the £ symbol properly, whereas print u'£' + '1' does? Is it because repr is used in the former, whereas str is used in the latter?
Also, how come concatenation of a unicode and a str work in this case, but not in the '£' + u'1' case?
You are mixing object types.
'£' is a bytestring, containing encoded data. That those bytes happen to represent a pound sign in your terminal or console is neither here nor there, it could just as much have been a pixel in an image. You terminal or console is configured to produce and accept UTF-8 data instead, so the actual content of that bytestring is the two bytes C2 and A3, when expresed in hexadecimal.
u'1' on the other hand is a Unicode string. It is unambiguously text data. If you want to concatenate other data to it, it too should be Unicode. Python 2 then will automatically decode str bytes to Unicode using the default ASCII codec if you try to do this.
However, the '£' bytestring is not decodable as ASCII. It can be decoded as UTF-8; decode the bytes explicitly, since we know the correct codec here:
print '£'.decode('utf8') + u'1'
When writing bytes to the terminal or console, it is your terminal or console that interprets the bytes and makes sense of them. If you write a unicode object to the terminal, the sys.stdout object takes care of encoding, converting the text to bytes your terminal or console will understand.
The same applies to taking input; the sys.stdin stream produces bytes, which Python can decode transparently when you use the u'£' syntax to create a Unicode object. You type the character on your keyboard, it is translated to UTF-8 bytes by the terminal or console, and written to Python to interpret.
That writing '\xc2\xa3' with print works, then, is a happy coincidence. You could take the unicode object, encode it to a different codec, and end up with garbage output:
>>> print u'£1'.encode('latin-1')
?1
My Mac terminal converted the data written for the £ sign to a ?, because the A3 byte (the Latin-1 codepoint for the pound sign) doesn't map to anything when interpreted as UTF-8.
Python determines the terminal or console codec from the locale.getpreferredencoding() function, you can observe what your terminal or console communicated it uses via the sys.stdout.encoding and sys.stdin.encoding attributes:
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
Last but not least, you should not confuse printing with the representations echoed by the interpreter in interactive mode. The interpreter shows the outcome of expressions using the repr() function, a debugging tool that tries to produce Python literal notation wherever possible, using only ASCII characters. For Unicode values, that means any non-printable, non-ASCII character is reflected using escape sequences. This makes the value suitable for copying and pasting without requiring more than an ASCII-capable medium.
The repr() result of a str uses \n for newlines, for example, and \xhh hex escapes for bytes without dedicated escape sequences, outside the printable range. In addition, for unicode objects, codepoints outside the Latin-1 range are represented with \uhhhh and \Uhhhhhhhh escape sequences depending on wether or not they are part of the basic multilingual plane:
>>> u'''\
... A multiline string to show newlines
... can contain £ latin characters
... or emoji 💩!
... '''
u'A multiline string to show newlines\ncan contain \xa3 latin characters\nor emoji \U0001f4a9!\n'
>>> print _
A multiline string to show newlines
can contain £ latin characters
or emoji 💩!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With