How to remove content in nested tags with BeautifulSoup? These posts showed the reverse to retrieve the content in nested tags: How to get contents of nested tag using BeautifulSoup, and BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?
I have tried .text but it only removes the tags
>>> from bs4 import BeautifulSoup as bs
>>> html = "<foo>Something something <bar> blah blah</bar> something</foo>"
>>> bs(html).find_all('foo')[0]
<foo>Something something <bar> blah blah</bar> something else</foo>
>>> bs(html).find_all('foo')[0].text
u'Something something blah blah something else'
Desired output:
Something something something else
You can check for bs4.element.NavigableString on children:
from bs4 import BeautifulSoup as bs
import bs4
html = "<foo>Something something <bar> blah blah</bar> something <bar2>GONE!</bar2> else</foo>"
def get_only_text(elem):
for item in elem.children:
if isinstance(item,bs4.element.NavigableString):
yield item
print ''.join(get_only_text(bs(html).find_all('foo')[0]))
Output;
Something something something else
Eg.
body = bs(html)
for tag in body.find_all('bar'):
tag.replace_with('')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With