Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing Wikipedia recursively and fetch text from included links

Because I hate clicking forth and back reading through Wikipedia articles I am trying to build a tool to create "expanded Wikipedia articles" according to the following algorithm:

  • Create two variables: Depth and Length.
  • Set a Wikipedia article as a seed page
  • Parse through this article: Whenever there is a link to another article fetch the first Length sentences and include it in the original article (e.g. in brackets or otherwise highlighted).
  • Do this recursively up to a certain Depth, i.e. not deeper than two levels.

The result would be an article that could be read in one go without always clicking to and fro...

How would you build such a mechanism in Python? Which libraries should be used (are there any for such tasks)? Are there any helpful tutorials?

like image 801
vonjd Avatar asked Jan 25 '26 12:01

vonjd


1 Answers

You can use urllib2 for requesting the url. For parsing the htmlpage there is wonderful library for you called BeautifulSoup. One thing you need to consider is that while scanning Wikipedia with your crawler you need to add a header alongwith your request. Or else Wikipedia will simply dissallow to be crawled.

 request = urllib2.Request(page)

adding header

 request.add_header('User-agent', 'Mozilla/5.0 (Linux i686)')

and then load the page and give it to BeautifulSoup.

 soup = BeautifulSoup(response)  
 text = soup.get_text()

this will give you the links in a page

 for url in soup.find_all('a',attrs={'href': re.compile("^http://")}):  
       link = url['href']

And now regarding the algorithm for crawling Wikipedia what you want is something called Depth Limited Search. A pseudocode is provided in the same page which is easy to follow.

And other functionality of the said libraries can be googled and are easy to follow. Good luck.

like image 151
Emil Avatar answered Jan 28 '26 01:01

Emil



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!