Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why does b'(and sometimes b' ') show up when I split some HTML source[Python]

I'm fairly new to Python and programming in general. I have done a few tutorials and am about 2/3 through a pretty good book. That being said I've been trying to get more comfortable with Python and proggramming by just trying things in the std lib out.

that being said I have recently run into a wierd quirk that I'm sure is the result of my own incorrect or un-"pythonic" use of the urllib module(with Python 3.2.2)

import urllib.request

HTML_source = urllib.request.urlopen(www.somelink.com).read()

print(HTML_source)

when this bit is run through the active interpreter it returns the HTML source of somelink, however it prefixes it with b' for example

b'<HTML>\r\n<HEAD> (etc). . . .

if I split the string into a list by whitespace it prefixes every item with the b'

I'm not really trying to accomplish something specific just trying to familiarize myself with the std lib. I would like to know why this b' is getting prefixed

also bonus -- Is there a better way to get HTML source WITHOUT using a third party module. I know all that jazz about not reinventing the wheel and what not but I'm trying to learn by "building my own tools"

Thanks in Advance!

like image 689
Oliver Avatar asked Feb 02 '26 15:02

Oliver


2 Answers

The "b" prefix means that the type is bytes not str. To convert the bytes into text, use the decode method and name the appropriate encoding. The encoding is often found in the "Content-Type" header:

>>> u = urllib.request.urlopen('http://cnn.com')
>>> u.getheader('Content-Type')
'text/html; charset=UTF-8'
>>> html = u.read().decode('utf-8')
>>> type(html)
<class 'str'>

If you don't find the encoding in the headers, try utf-8 as a default.

like image 115
Raymond Hettinger Avatar answered Feb 05 '26 05:02

Raymond Hettinger


b'' is a literal bytes object. There is no b'' objects in memory, only bytes. It is just a notation for bytes objects in your source code. Plain quotes '' in the source code create 'str' objects (Unicode strings).

If bytes object represents a text (not a binary data such as an image) then in general you should decode it to Unicode string as soon as possible. You should know the character encoding of the text.

HTML parsers such as lxml.html, BeautifulSoup may convert bytes to Unicode without your intervention.

If you don't know encoding then it might be none-trivial to detect it e.g., read how feedparser detects character encoding [2006].

like image 42
jfs Avatar answered Feb 05 '26 05:02

jfs



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!