Given some random news article, I want to write a web crawler to find the largest body of text present, and extract it. The intention is to extract the physical news article on the page.
The original plan was to use a BeautifulSoup findAll(True).getText() value. EDIT: don't use this for html work, use the lxml library, it's python based and much faster than BeautifulSoup. command (which means extract all html tags)
But this won't work for most pages, like the one I listed as an example, because the large body of text is split into many smaller tags, like paragraph dividers for example.
Does anyone have any experience with this? Any help with something like this would be amazing.
At the moment I'm using BeautifulSoup along with python, but willing to explore other possibilities.
Here are some deadly helpful python libraries for the task in sorted order of how much it helped me:
#1 goose library Fast, powerful, consistent #2 readability library Content is passable, slower on average than goose but faster than boilerpipe #3 python-boilerpipe Slower & hard to install, no fault to the boilerpipe library (originally in java), but to the fact that this library is build on top of another library in java, which attributes to IO time & errors, etc.
I'll release benchmarks perhaps if there is interest.
Indirectly related libraries, you should probably install them and read their docs:
A lot of the value and power in using python, a rather slow language, comes from it's open source libraries. They are especially awesome when combined and used together, and everyone should take advantage of them to solve whatever problems they may have!
Goose library gets lots of solid maintenance, they just added Arabic support, it's great!
Finding the text BeautifulSoup provides a simple way to find text content (i.e. non-HTML) from the HTML: text = soup.find_all (text=True) However, this is going to give us some information we don't want.
We'll use Beautiful Soup to parse the HTML as follows: from bs4 import BeautifulSoup soup = BeautifulSoup(html_page, 'html.parser') Finding the text. BeautifulSoup provides a simple way to find text content (i.e. non-HTML) from the HTML: text = soup.find_all(text=True) However, this is going to give us some information we don't want.
The BeautifulSoup object represents the parsed document as a whole. In this article, we’ll be scrapping a simple website and replacing the content in the parsed “soup” variable.
Now let’s parse the content as a BeautifulSoup object to extract the title and header tags of the website (as for this article) and to replace it in the original soup variable. The find () method returns the first matching case from the soup object. Replacing the content of the parsed soup obj with the “.string” method.
You might look at the python-readability package which does exactly this for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With