<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a>
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a>
from one's
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>
I am trying to reconstruct the sentence "to pay charges from one's bank account" that's split into the above HTML code. My problem is that one part of the sentence is not wrapped inside HTML tags. When I try to use:
BeautifulSoup.find_all()
I only get the text between the link tags and when I try to use
BeautifulSoup.contents
I only get "from one's" but not the text in between the link tags.
Is there a way to go through this code and reconstruct the sentence?
Edit: The above code is just an example, I am trying to scrape a dictionary so the order of the strings and which parts will be inside/outside tags will be arbitrary.
from bs4 import BeautifulSoup
html = """<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a>
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a>
from one's
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>"""
soup = BeautifulSoup(html)
print(soup.text)
# to pay
# charges
# from one's
# bank account
print(soup.text.replace('\n', ' '))
# to pay charges from one's bank account
Edit:
After digging into the dictionary website a bit, I came up with the following solution. Under a each <p> tag of a sentence, we could do the following:
from bs4.element import Tag
from bs4.element import NavigableString
res = []
for segment in p.contents:
if isinstance(segment, NavigableString):
res.append(segment)
elif isinstance(segment, Tag):
res.append(segment.text)
final_sentence = ''.join(res[:-2])
Hope it helps
If you just want to extract text from title attribute, you could do
# assuming text is the html text given above
soup = BeautifulSoup(text, 'html5lib')
a_tags = soup.select('a')
a_strs = (a['title'] for a in a_tags)
final_sentence = "{} {} from one's {}".format(a_strs)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With