I'm trying to get a string from a website. I use the requests module to send the GET request.
text = requests.get("http://example.com") #send GET requests to the website
print text.text #print the variable
However, for some reason, the text appears in Gibberish instead of Hebrew:
<div>
<p>שרת</p>
</div>
Tough when I sniff the traffic with Fiddler or view the website in my browser, I see it in Hebrew:
<div>
<p>שרת</p>
</div>
By the way, the html code contains meta-tag that defines the encoding, which is utf-8.
I tried to encode the text to utf-8 but it still in gibberish. I tried to deocde it using utf-8, but it throws UnicodeEncodeError exception.
I declared that I'm using utf-8 in the first line of the script.
Moreover, the problem is also happend when I send the request with the built in urllib module.
I read the Unicode HOWTO, but still couldn't manage to fix it. I also read many threads here (both about the UnicodeEncodeError exception and about why hebrew turns into gibberish in Python) but I still couldn't manage to fix it up.
I'm using Python 2.7.9 on a Windows machine. I'm running my script in the Python IDLE.
Thanks in advance.
The server isn't declaring the encoding correctly.
>>> print u'שרת'.encode('latin-1').decode('utf-8')
שרת
Set text.encoding before accessing text.text.
text = requests.get("http://example.com") #send GET requests to the website
text.encoding = 'utf-8' # Correct the page encoding
print text.text #print the variable
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With