How does one get the text from html while ignoring formatting tags using BeautifulSoup?

Question

The following code is used to grab continuous segments of text from within html.

    for text in soup.find_all_next(text=True):
        if isinstance(text, Comment):
            # We found a comment, ignore
            continue
        if not text.strip():
            # We found a blank text, ignore
            continue
        # Whatever is left must be good
        print(text)

Text items are broken up by structure tags like <div> or <br> but also formatting tags like <em> and <strong>. This causes me some inconvenience in further parsing of the text and I would like to be able to grab continuous text items while ignoring any formatting tags interior to the text.

For example, soup.find_all_next(text=True) would take the html code <div>This is <em>important</em> text</div> and return a single string, This is important text instead of three strings, This is, important, and text.

I'm not sure if that's clear... Let me know if it's not.

EDIT: The reason I'm walking through the html text item by text item is that I'm only beginning the walk after I see a specific "begin" comment tag and I'm stopping when I reach a specific "end" comment tag. Are there any solutions that work within this context of needing to walk item by item? The full code I'm using is below.

soup = BeautifulSoup(page)
for instanceBegin in soup.find_all(text=isBeginText):
    # We found a start comment, look at all text and comments:
    for text in instanceBegin.find_all_next(text=True):
        # We found a text or comment, examine it closely
        if isEndText(text):
            # We found the end comment, everybody out of the pool
            break
        if isinstance(text, Comment):
            # We found a comment, ignore
            continue
        if not text.strip():
            # We found a blank text, ignore
            continue
        # Whatever is left must be good
        print(text)

Where the two functions isBeginText(text) and isEndText(text) return true if the string passed to them matches my starting or ending comment tags.

chafreaky · Accepted Answer

If you grab the parent element containing your children elements and do get_text(), BeautifulSoup will strip out all html tags for you and only return a continuous string of text.

You can find an example here

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())

How does one get the text from html while ignoring formatting tags using BeautifulSoup?

Tags:

python

html

python-3.x

beautifulsoup

wrkyle

1 Answers

chafreaky

Recent Activity

Donate For Us

How does one get the text from html while ignoring formatting tags using BeautifulSoup?

Tags:

python

html

python-3.x

beautifulsoup

wrkyle

1 Answers

chafreaky

Related questions

Recent Activity

Donate For Us