Some UTF-8 text I'm trying to process has this lovely 4 byte character: \xF0\x9F\x98\xA5
Per this website, it's "disappointed but relieved face": http://apps.timwhitlock.info/emoji/tables/unicode
It appears to me that Python is treating this as two separate characters.
Here's my test code:
mystring = '\xF0\x9F\x98\xA5'.decode('utf-8')
print len(mystring)
print mystring
print len(mystring.encode('utf-8'))
for c in mystring:
print c
When I print mystring, I get a lovely face. But when I print the length of mystring I get 2.
Incidentally, the reason I'm trying to deal with this is that I need to address 4 byte characters in the string so I can push to a pre-5.5 MySQL database (which only handles 3 byte UTF-8).
I would appreciate help on why Python appears to recognize this as two characters, and also on how to detect 4 byte characters in UTF-8 string.
Thanks.
You're using a version of Python which doesn't yet properly count characters above U+FFFF. Some other languages (JAVA, JavaScript) behave like that (you can consider that a bug), newer versions of Python will correctly treat this as one character.
Recognising 4-byte characters is easy, the first byte of the 4 is always of the form 11110xxx (so all values in range(0xf0, 0xf8) ). They represent all code points above U+FFFF.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With