I've got a document like this:
<p class="top">I don't want this</p>
<p>I want this</p>
<table>
   <!-- ... -->
</table>
<img ... />
<p> and all that stuff too</p>
<p class="end>But not this and nothing after it</p>
I want to extract everything between the p[class=top] and p[class=end] paragraphs.
Is there a nice way I can do this with BeautifulSoup?
node.nextSibling attribute is your solution:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
nextNode = soup.find('p', {'class': 'top'})
while True:
    # process
    nextNode = nextNode.nextSibling
    if getattr(nextNode, 'name', None)  == 'p' and nextNode.get('class', None) == 'end':
        break
This complicated condition is to be sure that you're accessing attributes of HTML tag and not string nodes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With