Unable to extract text in the immediate level using BeautifulSoup

Question

I followed this method to extract the text from the immediate level of a tag by using find(text=True, recursive=False) as mentioned in the another answer, but for some particular markups like u'<p> <strong> Established </strong> 1865 </p> ' it's not working:

Here's the code:

markup = u'<p>
 <strong>
  Established
 </strong>
 1865
</p>
'
s = BeautifulSoup(markup, 'lxml')
print s.find('p').find(text=True, recursive=False)

And it prints

45: u'
'

It's working if I strip all the newlines from the markup it works good, but I don't think it's a good idea to just randomly strip all the newlines from the whole HTML file.

Any other solution ?

Mikhail M. · Accepted Answer

find returns first match only. You need to use find_all:

print(s.find('p').find_all(text=True, recursive=False))

['
', '
 1865
']

Deal with it as you need. For example, strip data and join pieces into final text:

data = s.find('p').find_all(text=True, recursive=False)
text = ' '.join(i.strip() for i in data)
print(text)

Unable to extract text in the immediate level using BeautifulSoup

Tags:

python

beautifulsoup

Devi Prasad Khatua

1 Answers

Mikhail M.

Recent Activity

Donate For Us

Unable to extract text in the immediate level using BeautifulSoup

Tags:

python

beautifulsoup

Devi Prasad Khatua

1 Answers

Mikhail M.

Related questions

Recent Activity

Donate For Us