Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove content in nested tags with BeautifulSoup?

How to remove content in nested tags with BeautifulSoup? These posts showed the reverse to retrieve the content in nested tags: How to get contents of nested tag using BeautifulSoup, and BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?

I have tried .text but it only removes the tags

>>> from bs4 import BeautifulSoup as bs
>>> html = "<foo>Something something <bar> blah blah</bar> something</foo>"
>>> bs(html).find_all('foo')[0]
<foo>Something something <bar> blah blah</bar> something else</foo>
>>> bs(html).find_all('foo')[0].text
u'Something something  blah blah something else'

Desired output:

Something something something else

like image 238
alvas Avatar asked Dec 05 '25 16:12

alvas


2 Answers

You can check for bs4.element.NavigableString on children:

from bs4 import BeautifulSoup as bs
import bs4
html = "<foo>Something something <bar> blah blah</bar> something <bar2>GONE!</bar2> else</foo>"
def get_only_text(elem):
    for item in elem.children:
        if isinstance(item,bs4.element.NavigableString):
            yield item

print ''.join(get_only_text(bs(html).find_all('foo')[0]))

Output;

Something something  something  else
like image 122
Alvaro Fuentes Avatar answered Dec 07 '25 08:12

Alvaro Fuentes


Eg.

body = bs(html)
for tag in body.find_all('bar'):
    tag.replace_with('')
like image 42
Ricardo Cárdenes Avatar answered Dec 07 '25 08:12

Ricardo Cárdenes



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!