Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

which encoding does the python lxml module use internally?

When I get a webpage, I use UnicodeDammit to convert it to utf-8 encoding, just like:

import chardet
from lxml import html
content = urllib2.urlopen(url).read()
encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
    content = content.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(content, base_url=url)

but when I use:

text = doc.text_content()
print type(text)

The output is <type 'lxml.etree._ElementUnicodeResult'>. why? I thought it would be a utf-8 string.

like image 264
caron Avatar asked Oct 21 '25 11:10

caron


2 Answers

lxml.etree._ElementUnicodeResult is a class that inherits from unicode:

$ pydoc lxml.etree._ElementUnicodeResult

lxml.etree._ElementUnicodeResult = class _ElementUnicodeResult(__builtin__.unicode)
 |  Method resolution order:
 |      _ElementUnicodeResult
 |      __builtin__.unicode
 |      __builtin__.basestring
 |      __builtin__.object

In Python, it's fairly common to have classes that extend from base types to add some module-specific functionality. It should be safe to treat the object like a regular Unicode string.

You might want to skip the re-encoding step, as lxml.html will automatically use the encoding specified in the source file, and as long as it ends up as valid unicode, there's perhaps no reason to be concerned with how it was initially encoded.

Unless your project is so small and informal that you can be sure you will never encounter 8-bit strings (i.e. it's always 7-bit ASCII, English with no special characters), it's wise to get your text into unicode as early as possible (like right after retrieval) and keep it that way until you need to serialize it for writing to a file or sending over a socket.

The reason you're seeing <type 'lxml.etree._ElementUnicodeResult'> is because lxml.html.fromstring() is automatically doing the decode step for you. Note this means your code above will not work for a page encoded with UTF-16, for example, since the 8-bit string will be encoded in UTF-8 but the html will still be saying utf-16

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

and lxml will try to decode the string based on utf-16 encoding rules, raising an exception in short order I would expect.

If you want the output serialized as a UTF-8 encoded 8-bit string, all you need is this:

>>> text = doc.text_content().encode('utf-8')
>>> print type(text)
<type 'str'>
like image 22
scanny Avatar answered Oct 24 '25 08:10

scanny