Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Python recognise this UTF-8 character as two characters rather than one

Some UTF-8 text I'm trying to process has this lovely 4 byte character: \xF0\x9F\x98\xA5

Per this website, it's "disappointed but relieved face": http://apps.timwhitlock.info/emoji/tables/unicode

It appears to me that Python is treating this as two separate characters.

Here's my test code:

mystring = '\xF0\x9F\x98\xA5'.decode('utf-8')

print len(mystring)

print mystring

print len(mystring.encode('utf-8'))

for c in mystring:
    print c

When I print mystring, I get a lovely face. But when I print the length of mystring I get 2.

Incidentally, the reason I'm trying to deal with this is that I need to address 4 byte characters in the string so I can push to a pre-5.5 MySQL database (which only handles 3 byte UTF-8).

I would appreciate help on why Python appears to recognize this as two characters, and also on how to detect 4 byte characters in UTF-8 string.

Thanks.

like image 629
user1379351 Avatar asked Oct 29 '25 00:10

user1379351


1 Answers

You're using a version of Python which doesn't yet properly count characters above U+FFFF. Some other languages (JAVA, JavaScript) behave like that (you can consider that a bug), newer versions of Python will correctly treat this as one character.

Recognising 4-byte characters is easy, the first byte of the 4 is always of the form 11110xxx (so all values in range(0xf0, 0xf8) ). They represent all code points above U+FFFF.

like image 134
roeland Avatar answered Oct 31 '25 15:10

roeland