Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode Emoji's in python from csv files

I have some csv data of some users tweet.

In excel it is displayed like this:

‰ÛÏIt felt like they were my friends and I was living the story with them‰Û  #retired #IAN1 

I had imported this csv file into python and in python the same tweet appears like this (I am using putty to connect to a server and I copied this from putty's screen)

▒▒▒It felt like they were my friends and I was living the story with them▒۝ #retired #IAN1 

I am wondering how to display these emoji characters properly. I am trying to separate all the words in this tweet but I am not sure how I can separate those emoji unicode characters.

like image 975
Morpheus Avatar asked Jan 25 '26 05:01

Morpheus


1 Answers

In fact, you certainly have a loss of data…

I don’t know how you get your CSV file from users tweet (you may explain that). But generally, CSV files are encoded in "cp1252" (or "windows-1252"), sometimes in "iso-8859-1" encoding. Nowadays, we can found CSV files encoded in "utf-8".

If you tweets are encoded in "cp1252" or any 8-bit single-byte coded character sets, the Emojis are lost (replaced by "?") or badly converted.

Then, if you open your CSV file into Excel, it will use it’s default encoding ("cp1252") and load the file with corrupted characters. You can try with Libre Office, it has a dialog box which allows you to choose your encoding more easily.

The copy/paste from Putty will also convert your characters depending of your console encoding… It is worst!

If your CSV file use "utf-8" encoding (or "utf-16", "utf-32") you may have more chance to preserve the Emojis. But there is still a problem: most Emojis have a code-point greater that U+FFFF (65535 in decimal). For instance, Grinning Face "😀" has the code-point U+1F600).

This kind of characters are badly handled in Python, try this:

# coding: utf8
from __future__ import unicode_literals

emoji = u"😀"

print(u"emoji: " + emoji)
print(u"repr: " + repr(emoji))
print(u"len: {}".format(len(emoji)))

You’ll get (if your console allow it):

emoji: 😀
repr: u'\U0001f600'
len: 2
  • The first line won’t print if your console don’t allow unicode,
  • The \U escape sequence is similar to the \u, but expects 8 hex digits, not 4.
  • Yes, this character has a length of 2!

EDIT: With Python 3, you get:

emoji: 😀
repr: '😀'
len: 1
  • No escape sequence for repr(),
  • the length is 1!

What you can do is posting your CSV file (a fragment) as attachment, then one could analyse it…

See also Unicode Literals in Python Source Code in the Python 2.7 documentation.

like image 141
Laurent LAPORTE Avatar answered Jan 26 '26 19:01

Laurent LAPORTE



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!