The following code is used to grab continuous segments of text from within html.
for text in soup.find_all_next(text=True):
if isinstance(text, Comment):
# We found a comment, ignore
continue
if not text.strip():
# We found a blank text, ignore
continue
# Whatever is left must be good
print(text)
Text items are broken up by structure tags like <div>
or <br>
but also formatting tags like <em>
and <strong>
. This causes me some inconvenience in further parsing of the text and I would like to be able to grab continuous text items while ignoring any formatting tags interior to the text.
For example, soup.find_all_next(text=True)
would take the html code <div>This is <em>important</em> text</div>
and return a single string, This is important text
instead of three strings, This is
, important
, and text
.
I'm not sure if that's clear... Let me know if it's not.
EDIT: The reason I'm walking through the html text item by text item is that I'm only beginning the walk after I see a specific "begin" comment tag and I'm stopping when I reach a specific "end" comment tag. Are there any solutions that work within this context of needing to walk item by item? The full code I'm using is below.
soup = BeautifulSoup(page)
for instanceBegin in soup.find_all(text=isBeginText):
# We found a start comment, look at all text and comments:
for text in instanceBegin.find_all_next(text=True):
# We found a text or comment, examine it closely
if isEndText(text):
# We found the end comment, everybody out of the pool
break
if isinstance(text, Comment):
# We found a comment, ignore
continue
if not text.strip():
# We found a blank text, ignore
continue
# Whatever is left must be good
print(text)
Where the two functions isBeginText(text)
and isEndText(text)
return true if the string passed to them matches my starting or ending comment tags.
If you grab the parent element containing your children elements and do get_text()
, BeautifulSoup will strip out all html tags for you and only return a continuous string of text.
You can find an example here
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With