I have some HTML code like this:
<p>aaa</p>bbb
<p>ccc</p>ddd
How can I get 'bbb' and 'ddd'?
You can read the subsequent sibling of each p tag (note this is very specific to this text, so hopefully it can be expanded to your situation):
In [1]: from bs4 import BeautifulSoup
In [2]: html = """\
...: <p>aaa</p>bbb
...: <p>ccc</p>ddd"""
In [3]: soup = BeautifulSoup(html)
In [4]: [p.next_sibling for p in soup.findAll('p')]
Out[4]: [u'bbb\n', u'ddd']
This picks up the trailing newline, so you can strip it off if need be:
In [5]: [p.next_sibling.strip() for p in soup.findAll('p')]
Out[5]: [u'bbb', u'ddd']
The general idea is that you locate the tag(s) before your target text and then find the next sibling element, which should be your text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With