Because I hate clicking forth and back reading through Wikipedia articles I am trying to build a tool to create "expanded Wikipedia articles" according to the following algorithm:
Depth and Length.Length sentences and include it in the original article (e.g. in brackets or otherwise highlighted).Depth, i.e. not deeper than two levels.The result would be an article that could be read in one go without always clicking to and fro...
How would you build such a mechanism in Python? Which libraries should be used (are there any for such tasks)? Are there any helpful tutorials?
You can use urllib2 for requesting the url. For parsing the htmlpage there is wonderful library for you called BeautifulSoup. One thing you need to consider is that while scanning Wikipedia with your crawler you need to add a header alongwith your request. Or else Wikipedia will simply dissallow to be crawled.
request = urllib2.Request(page)
adding header
request.add_header('User-agent', 'Mozilla/5.0 (Linux i686)')
and then load the page and give it to BeautifulSoup.
soup = BeautifulSoup(response)
text = soup.get_text()
this will give you the links in a page
for url in soup.find_all('a',attrs={'href': re.compile("^http://")}):
link = url['href']
And now regarding the algorithm for crawling Wikipedia what you want is something called Depth Limited Search. A pseudocode is provided in the same page which is easy to follow.
And other functionality of the said libraries can be googled and are easy to follow. Good luck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With