Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use BeautifulSoup to get text from parent and children tags to put into a DOCX table

I am trying to use BeautifulSoup to parse the claims from google.com/patents and put them into a DOCX table.

I have managed to retrieve the claims, but unfortunately the parent div tag has the first part of the claim, and the children div-s are parts of the rest of the claim as seen in the picture below.

HTML Code

When I run the program, the first cell in the table has the parent and all the children div text in it, and the div children propagate the following table cells.

I would like to propagate the first cell in the DOCX table with the text from the Parent div while excluding the children div, and the following cells with text from the children div-s.

I have tried .decompose the claim to get the parent, I have tried figuring out how to rename the children to put into the table.

   from bs4 import BeautifulSoup
   import requests
   from docx import Document
   from docx.enum.table import WD_TABLE_DIRECTION

   document = Document()

   url = 'https://patents.google.com/patent/US7054130?oq=US7654309'

   response = requests.get(url)
   data = response.text
   soup = BeautifulSoup(data, 'html.parser')

   claims = soup.select('div .claim-text')

   table = document.add_table(rows=1, cols=2, style='Table Grid')

   for claim in claims:

        if not claim.find('claim-ref'):

            try:
                print(claim.text + '\n')
                cells = table.add_row().cells
                cells[0].text = claim.text

                # Add space between paragraphs
                document.add_paragraph('')

            except:

                continue

    document.save('my_test.docx')

I want to be able to parse the claims with the text from the beginning of the claim found in the parent into cell 1 of a DOCX table and exclude the children from the cell. The children s should go into their own cell respectively.

This is what I get when I try to run the program: This is what I get when I try to run the program

This is what I am wanting to achieve: This is what I want

I haven't been able to figure out how to separate the text from the parent and the children.

like image 799
BubbaJones Avatar asked Jan 24 '26 22:01

BubbaJones


2 Answers

To avoid getting duplicates, just get the whole text from the top div and split it appropriately, for example:

from bs4 import BeautifulSoup
import requests
from docx import Document

document = Document()
url = 'https://patents.google.com/patent/US7054130?oq=US7654309'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
claims_section = soup.find('section', itemprop='claims').div.div
table = document.add_table(rows=0, cols=2, style='Table Grid')

for div in claims_section.find_all('div', class_='claim', recursive=False):
    div_claim_text = div.find_next('div', class_='claim-text')
    lines = [line.strip() for line in div_claim_text.text.splitlines() if line.strip()]

    for line in lines:
        cells = table.add_row().cells
        cells[0].text = line

document.save('my_test.docx')

This approach only stores the independent claims.

like image 154
Martin Evans Avatar answered Jan 27 '26 11:01

Martin Evans


You can get text from the parent div, then get texts from children div, then append data in a new list, created for this purpose.

//div/text[1] allows to get the first text from the div

[e for e in _list if e] allows to remove empty elements

Try this:

from lxml import html
import requests
from docx import Document
from docx.enum.table import WD_TABLE_DIRECTION

document = Document()

url = 'https://patents.google.com/patent/US7054130?oq=US7654309'

response = requests.get(url)
data = response.text
doc = html.fromstring(data)

parent_claim = [e.strip() for e in doc.xpath("//div[@id='CLM-00001']/div[@class='claim-text']/text()[1]") if e.strip()]
children_claims = [e.strip() for e in doc.xpath("//div[@id='CLM-00001']/div[@class='claim-text']/div[@class='claim-text']/text()") if e.strip()]
table = document.add_table(rows=1, cols=2, style='Table Grid')
claims = []
for e in parent_claim:
    claims.append(e)
for e in children_claims:
    claims.append(e)

for claim in claims:

        print(claim + '\n')
        cells = table.add_row().cells
        cells[0].text = claim

        # Add space between paragraphs
        document.add_paragraph('')

document.save('my_test.docx')

Output:

enter image description here

like image 42
sashaboulouds Avatar answered Jan 27 '26 12:01

sashaboulouds



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!