I have a problem printing hebrew words. i am using the counter module in order to
count number of words in my given text (which is in hebrew). the counter indeed counts
the words, and identifies the language because i am using # -*- coding: utf-8 -*-
The problem is, when i print my counter, i get weird symbols. (I am using eclipse) Here is the code and the printings:
# -*- coding: utf-8 -*-
import string
from collections import Counter
class classifier:
def __init__(self,filename):
self.myFile = open(filename)
self.cnt = Counter()
def generateList(self):
exclude = set(string.punctuation)
for lines in self.myFile:
for word in lines.split():
if word not in exclude:
nWord = ""
for letter in word:
if letter in exclude:
letter = ""
nWord += letter
else:
nWord += letter
self.cnt[nWord]+=1
print self.cnt
Printings:
Counter({'\xd7\x97\xd7\x94': 465, '\xd7\x96\xd7\x95': 432, '\xd7\xa1\xd7\x92\xd7\x95\xd7\xa8': 421, '\xd7\x94\xd7\x92\xd7\x91': 413})
Any idea on how to print the words in the right way?
The "weird symbols" you are getting is python's way of representing unicode strings.
You need to decode them, for example:
>>>print '\xd7\x97\xd7\x94'.decode('UTF8')
חה
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With