I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.
The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.
One of the sections of code that is causing problems is shown below:
agent_telno = agent.find('div', 'agent_contact_number') agent_telno = '' if agent_telno is None else agent_telno.contents[0] p.agent_info = str(agent_contact + ' ' + agent_telno).strip() Here is a stack trace produced on SOME strings when the snippet above is run:
Traceback (most recent call last):   File "foobar.py", line 792, in <module>     p.agent_info = str(agent_contact + ' ' + agent_telno).strip() UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128) I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English.
Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?
You need to use the . encode method of the unicode string to convert it to a series of bytes that represent that string before you can save it. That's basically what the str type represents in Python 2. x - just a series of bytes, not a series of characters like you might expect.
The Python "UnicodeEncodeError: 'ascii' codec can't encode character in position" occurs when we use the ascii codec to encode a string that contains non-ascii characters. To solve the error, specify the correct encoding, e.g. utf-8 . Here is an example of how the error occurs. Copied!
You need to read the Python Unicode HOWTO. This error is the very first example.
Basically, stop using str to convert from unicode to encoded text / bytes.
Instead, properly use .encode() to encode the string:
p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip() or work entirely in unicode.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With