I followed this method to extract the text from the immediate level of a tag by using find(text=True, recursive=False) as mentioned in the another answer, but for some particular markups like u'<p>\n <strong>\n Established\n </strong>\n 1865\n</p>\n' it's not working:
Here's the code:
markup = u'<p>\n <strong>\n Established\n </strong>\n 1865\n</p>\n'
s = BeautifulSoup(markup, 'lxml')
print s.find('p').find(text=True, recursive=False)
And it prints
45: u'\n'
It's working if I strip all the newlines \n from the markup it works good, but I don't think it's a good idea to just randomly strip all the newlines from the whole HTML file.
Any other solution ?
find returns first match only. You need to use find_all:
print(s.find('p').find_all(text=True, recursive=False))
['\n', '\n 1865\n']
Deal with it as you need. For example, strip data and join pieces into final text:
data = s.find('p').find_all(text=True, recursive=False)
text = ' '.join(i.strip() for i in data)
print(text)
1865
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With