I am completely new to web parsing with Python/BeautifulSoup. I have an HTML that has (part of) the code as follows:
<div id="pages">
<ul>
<li class="active"><a href="example.com">Example</a></li>
<li><a href="example.com">Example</a></li>
<li><a href="example1.com">Example 1</a></li>
<li><a href="example2.com">Example 2</a></li>
</ul>
</div>
I have to visit each link (basically each <li> element) until there are no more <li> tags present. Each time a link is clicked, its corresponding <li> element gets class as 'active'. My code is:
from bs4 import BeautifulSoup
import urllib2
import re
landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)
pageList = soup.find("div", {"id": "pages"})
page = pageList.find("li", {"class": "active"})
This code gives me the first <li> item in the list. My logic is I am keeping on checking if the next_sibling is not None. If it is not None, I am creating an HTTP request to the href attribute of the <a> tag in that sibling <li>. That would get me to the next page, and so on, till there are no more pages.
But I can't figure out how to get the next_sibling of the page variable given above. Is it page.next_sibling.get("href") or something like that? I looked through the documentation, but somehow couldn't find it. Can someone help please?
Use find_next_sibling() and be explicit about what sibling element do you want to find:
next_li_element = page.find_next_sibling("li")
next_li_element would become None if the page corresponds to the last active li:
if next_li_element is None:
# no more pages to go
Have you looked at dir(page) or the documentation? If so, how did you miss .find_next_sibling()?
from bs4 import BeautifulSoup
import urllib2
import re
landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)
pageList = soup.find("div", {"id": "pages"})
page = pageList.find("li", {"class": "active"})
sibling = page.find_next_sibling()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With