I have a crawler that parses the HMTL of a given site and prints parts of the source code. Here is my script:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import urllib.request
import re
class Crawler:
headers = {'User-Agent' : 'Mozilla/5.0'}
keyword = 'arroz'
def extra(self):
url = "http://buscando.extra.com.br/search?w=" + self.keyword
r = requests.head(url, allow_redirects=True)
print(r.url)
html = urllib.request.urlopen(urllib.request.Request(url, None, self.headers)).read()
soup = BeautifulSoup(html, 'html.parser')
return soup.encode('utf-8')
def __init__(self):
extra = self.extra()
print(extra)
Crawler()
My code works fine, but it prints the source with an annoying b' before the value. I already tried to use decode('utf-8') but it didn't work. Any ideas?
UPDATE
If I don't use the encode('utf-8') I have the following error:
Traceback (most recent call last):
File "crawler.py", line 25, in <module>
Crawler()
File "crawler.py", line 23, in __init__
print(extra)
File "c:\Python34\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2030' in position
13345: character maps to <undefined>
When I run your code as given except replacing return soup.encode('utf-8') with return soup, it works fine. My environment:
15.103.4.3beautifulsoup4==4.3.2This leads me to suspect that the problem lies with your environment, not your code. Your stack trace mentions cp850.py and your source is hitting a .com.br site - this makes me think that perhaps the default encoding of your shell can't handle unicode characters. Here's the Wikipedia page for cp850 - Code Page 850.
You can check the default encoding your terminal is using with:
>>> import sys
>>> sys.stdout.encoding
My terminal responds with:
'UTF-8'
I'm assuming yours won't and that this is the root of the issue you are running into.
EDIT:
In fact, I can exactly replicate your error with:
>>> print("\u2030".encode("cp850"))
So that's the issue - because of your computer's locale settings, print is implicitly converting to your system's default encoding and raising the UnicodeDecodeError.
Updating Windows to display unicode characters from the command prompt is a bit outside my wheelhouse so I can't offer any advice other than to direct you to a relevant question/answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With