Extract text from HTML Tags and plain text (not wrapped in tags)

Question

<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a> 
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a> 
from one's 
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>

I am trying to reconstruct the sentence "to pay charges from one's bank account" that's split into the above HTML code. My problem is that one part of the sentence is not wrapped inside HTML tags. When I try to use:

BeautifulSoup.find_all()

I only get the text between the link tags and when I try to use

BeautifulSoup.contents

I only get "from one's" but not the text in between the link tags.

Is there a way to go through this code and reconstruct the sentence?

Edit: The above code is just an example, I am trying to scrape a dictionary so the order of the strings and which parts will be inside/outside tags will be arbitrary.

Alex Hall · Accepted Answer

from bs4 import BeautifulSoup

html = """<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a>
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a>
from one's
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>"""

soup = BeautifulSoup(html)

print(soup.text)
# to pay
# charges
# from one's
# bank account

print(soup.text.replace('
', ' '))
# to pay charges from one's bank account

ujhuyz0110 · Answer

Edit: After digging into the dictionary website a bit, I came up with the following solution. Under a each <p> tag of a sentence, we could do the following:

from bs4.element import Tag
from bs4.element import NavigableString


res = []

for segment in p.contents:
    if isinstance(segment, NavigableString):
        res.append(segment)
    elif isinstance(segment, Tag):
        res.append(segment.text)

final_sentence = ''.join(res[:-2])

Hope it helps

If you just want to extract text from title attribute, you could do

# assuming text is the html text given above
soup = BeautifulSoup(text, 'html5lib')
a_tags = soup.select('a')
a_strs = (a['title'] for a in a_tags)
final_sentence = "{} {} from one's {}".format(a_strs)

Extract text from HTML Tags and plain text (not wrapped in tags)

Tags:

python

html

python-3.x

beautifulsoup

BluNova897

2 Answers

Alex Hall

ujhuyz0110

Recent Activity

Donate For Us

Extract text from HTML Tags and plain text (not wrapped in tags)

Tags:

python

html

python-3.x

beautifulsoup

BluNova897

2 Answers

Alex Hall

ujhuyz0110

Related questions

Recent Activity

Donate For Us