Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text from HTML Tags and plain text (not wrapped in tags)

<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a> 
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a> 
from one's 
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>

I am trying to reconstruct the sentence "to pay charges from one's bank account" that's split into the above HTML code. My problem is that one part of the sentence is not wrapped inside HTML tags. When I try to use:

BeautifulSoup.find_all()

I only get the text between the link tags and when I try to use

BeautifulSoup.contents

I only get "from one's" but not the text in between the link tags.

Is there a way to go through this code and reconstruct the sentence?

Edit: The above code is just an example, I am trying to scrape a dictionary so the order of the strings and which parts will be inside/outside tags will be arbitrary.

like image 628
BluNova897 Avatar asked Nov 23 '25 06:11

BluNova897


2 Answers

from bs4 import BeautifulSoup

html = """<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a>
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a>
from one's
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>"""

soup = BeautifulSoup(html)

print(soup.text)
# to pay
# charges
# from one's
# bank account

print(soup.text.replace('\n', ' '))
# to pay charges from one's bank account 
like image 77
Alex Hall Avatar answered Nov 25 '25 19:11

Alex Hall


Edit: After digging into the dictionary website a bit, I came up with the following solution. Under a each <p> tag of a sentence, we could do the following:

from bs4.element import Tag
from bs4.element import NavigableString


res = []

for segment in p.contents:
    if isinstance(segment, NavigableString):
        res.append(segment)
    elif isinstance(segment, Tag):
        res.append(segment.text)

final_sentence = ''.join(res[:-2])

Hope it helps


If you just want to extract text from title attribute, you could do

# assuming text is the html text given above
soup = BeautifulSoup(text, 'html5lib')
a_tags = soup.select('a')
a_strs = (a['title'] for a in a_tags)
final_sentence = "{} {} from one's {}".format(a_strs)
like image 42
ujhuyz0110 Avatar answered Nov 25 '25 20:11

ujhuyz0110



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!