When I get a webpage, I use UnicodeDammit to convert it to utf-8 encoding, just like:
import chardet
from lxml import html
content = urllib2.urlopen(url).read()
encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
content = content.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(content, base_url=url)
but when I use:
text = doc.text_content()
print type(text)
The output is <type 'lxml.etree._ElementUnicodeResult'>.
why? I thought it would be a utf-8 string.
lxml.etree._ElementUnicodeResult is a class that inherits from unicode:
$ pydoc lxml.etree._ElementUnicodeResult
lxml.etree._ElementUnicodeResult = class _ElementUnicodeResult(__builtin__.unicode)
| Method resolution order:
| _ElementUnicodeResult
| __builtin__.unicode
| __builtin__.basestring
| __builtin__.object
In Python, it's fairly common to have classes that extend from base types to add some module-specific functionality. It should be safe to treat the object like a regular Unicode string.
You might want to skip the re-encoding step, as lxml.html will automatically use the encoding specified in the source file, and as long as it ends up as valid unicode, there's perhaps no reason to be concerned with how it was initially encoded.
Unless your project is so small and informal that you can be sure you will never encounter 8-bit strings (i.e. it's always 7-bit ASCII, English with no special characters), it's wise to get your text into unicode as early as possible (like right after retrieval) and keep it that way until you need to serialize it for writing to a file or sending over a socket.
The reason you're seeing <type 'lxml.etree._ElementUnicodeResult'> is because lxml.html.fromstring() is automatically doing the decode step for you. Note this means your code above will not work for a page encoded with UTF-16, for example, since the 8-bit string will be encoded in UTF-8 but the html will still be saying utf-16
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
and lxml will try to decode the string based on utf-16 encoding rules, raising an exception in short order I would expect.
If you want the output serialized as a UTF-8 encoded 8-bit string, all you need is this:
>>> text = doc.text_content().encode('utf-8')
>>> print type(text)
<type 'str'>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With