I am using beautiful soup to get the text from this example html code:
....
<div style="s1">
<div style="s2">Here is text 1</div>
<div style="s3">Here is text 2</div>
Here is text 3 and this is what I want.
</div>
....
Text 1 and text 2 is at the same level 2 and the text 3 is at the upper level 1. I only want to get the text 3 and used this:
for anchor in tbody.findAll('div', style="s1"):
review=anchor.text
print review
But these code get me all the text 1,2,3. How do I only get the first level text 3?
Something like:
for anchor in tbody.findAll('div', style="s1"):
text = ''.join([x for x in anchor.contents if isinstance(x, bs4.element.NavigableString)])
works. Just know that you'll also get the line breaks in there, so .strip()ing might be necessary.
For example:
for anchor in tbody.findAll('div', style="s1"):
text = ''.join([x for x in anchor.contents if isinstance(x, bs4.element.NavigableString)])
print([text])
print([text.strip()])
Prints
[u'\n\n\nHere is text 3 and this is what I want.\n']
[u'Here is text 3 and this is what I want.']
(I put them in lists so you could see the newlines.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With