I am trying to use BeautifulSoup to parse the claims from google.com/patents and put them into a DOCX table.
I have managed to retrieve the claims, but unfortunately the parent div tag has the first part of the claim, and the children div-s are parts of the rest of the claim as seen in the picture below.

When I run the program, the first cell in the table has the parent and all the children div text in it, and the div children propagate the following table cells.
I would like to propagate the first cell in the DOCX table with the text from the Parent div while excluding the children div, and the following cells with text from the children div-s.
I have tried .decompose the claim to get the parent, I have tried figuring out how to rename the children to put into the table.
from bs4 import BeautifulSoup
import requests
from docx import Document
from docx.enum.table import WD_TABLE_DIRECTION
document = Document()
url = 'https://patents.google.com/patent/US7054130?oq=US7654309'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
claims = soup.select('div .claim-text')
table = document.add_table(rows=1, cols=2, style='Table Grid')
for claim in claims:
if not claim.find('claim-ref'):
try:
print(claim.text + '\n')
cells = table.add_row().cells
cells[0].text = claim.text
# Add space between paragraphs
document.add_paragraph('')
except:
continue
document.save('my_test.docx')
I want to be able to parse the claims with the text from the beginning of the claim found in the parent into cell 1 of a DOCX table and exclude the children from the cell. The children s should go into their own cell respectively.
This is what I get when I try to run the program:

This is what I am wanting to achieve:

I haven't been able to figure out how to separate the text from the parent and the children.
To avoid getting duplicates, just get the whole text from the top div and split it appropriately, for example:
from bs4 import BeautifulSoup
import requests
from docx import Document
document = Document()
url = 'https://patents.google.com/patent/US7054130?oq=US7654309'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
claims_section = soup.find('section', itemprop='claims').div.div
table = document.add_table(rows=0, cols=2, style='Table Grid')
for div in claims_section.find_all('div', class_='claim', recursive=False):
div_claim_text = div.find_next('div', class_='claim-text')
lines = [line.strip() for line in div_claim_text.text.splitlines() if line.strip()]
for line in lines:
cells = table.add_row().cells
cells[0].text = line
document.save('my_test.docx')
This approach only stores the independent claims.
You can get text from the parent div, then get texts from children div, then append data in a new list, created for this purpose.
//div/text[1] allows to get the first text from the div
[e for e in _list if e] allows to remove empty elements
Try this:
from lxml import html
import requests
from docx import Document
from docx.enum.table import WD_TABLE_DIRECTION
document = Document()
url = 'https://patents.google.com/patent/US7054130?oq=US7654309'
response = requests.get(url)
data = response.text
doc = html.fromstring(data)
parent_claim = [e.strip() for e in doc.xpath("//div[@id='CLM-00001']/div[@class='claim-text']/text()[1]") if e.strip()]
children_claims = [e.strip() for e in doc.xpath("//div[@id='CLM-00001']/div[@class='claim-text']/div[@class='claim-text']/text()") if e.strip()]
table = document.add_table(rows=1, cols=2, style='Table Grid')
claims = []
for e in parent_claim:
claims.append(e)
for e in children_claims:
claims.append(e)
for claim in claims:
print(claim + '\n')
cells = table.add_row().cells
cells[0].text = claim
# Add space between paragraphs
document.add_paragraph('')
document.save('my_test.docx')
Output:

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With