I've seen here on SO many ways to initialize a Beautifulsoup object. As far as I can see, you can either pass a string=url or to pass some object. For instance, it's common to use urllib:
url="https://somesite.com"
url_html="<html><body><h1>Some header</h1><p>asdas</p></body></html>"
soup1=BeautifulSoup(url_html, "html.parser") #1st way
print(soup1.find("p").text) #can get the text "asdas"
soup2=BeautifulSoup(urllib.request.urlopen(url).read(), "html.parser") #2nd way
soup3=BeautifulSoup(urllib.request.urlopen(url), "html.parser") #3rd way
print(soup1.prettify())
print(soup2.prettify())
print(soup3.prettify())
But what happens inside the two last ways of initializing the soup? As far as I can see, urllib.request.urlopen(url).read() is the same thing as a pure html string url_html. But what about soup3?
Does it works because BeautifulSoup's constructor expects a string and there is a toString method in the object returned by urlopen()? And the object is converted into string and in reality 3rd method is the same as the 2nd?
Are there any other ways of initializing BeautifulSoup? Which is preferable?
urlopen() returns an open file-like object. The constructor of Beautifulsoup uses type-checking to see whether it got a file or a string (to be precise, it does markup.hasattr("read"). In the first case, it simply calls its read() method.
This is a common pattern in Python libraries that deal with big amounts of user-provided text data.
The difference in Soup's case is non-existent. Other libraries might do something more intelligent with a file object, e.g. partition it and not load it to memory en bloque.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With