I want to find Extract both the Header and Paragraph Text from a Webpage. The issue is that there are flexible number of Headers and Paragraphs after the headers with the same header tag and the paragraph tag.
Sample HTML -
<h6>PHYSICAL DESCRIPTION</h6>
<p>
    <strong class="offender">YOB:</strong> 1987<br />
    <strong class="offender">RACE:</strong> WHITE<br />
    <strong class="offender">GENDER:</strong> FEMALE<br />
    <strong class="offender">HEIGHT:</strong> 5'05''<br />
    <strong class="offender">WEIGHT:</strong> 118<br />
    <strong class="offender">EYE COLOR:</strong> GREEN<br />
    <strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
<h6>SCARS, MARKS, TATTOOS</h6>
<p>     
       
</p>
I am using code as below -
sub = soup.findAll('h6')
    print sub.text
sub = soup.findAll('p')
for strong_tag in sub.find_all('strong'):
    print strong_tag.text, strong_tag.next_sibling
Since headers dont include the p tags in them, i am not sure how to process the HTML in order that its written.
Is there a way that we can treat the HTML like a file and find the next h6 tags and then the next p tag and keep doing it till the end?
You can use Tag.find_next_sibling() here:
for header in soup.find_all('h6'):
    para = header.find_next_sibling('p')
The .find_next_sibling() call returns the first p tag that is a next sibling of the header tag.
Demo:
>>> for header in soup.find_all('h6'):
...     print header.text
...     para = header.find_next_sibling('p')
...     for strong_tag in para.find_all('strong'):
...         print strong_tag.text, strong_tag.next_sibling
...     print
... 
PHYSICAL DESCRIPTION
YOB:  1987
RACE:  WHITE
GENDER:  FEMALE
HEIGHT:  5'05''
WEIGHT:  118
EYE COLOR:  GREEN
HAIR COLOR:  BROWN
SCARS, MARKS, TATTOOS
This could find the wrong <p> tag in case you have no paragraph between the current header and the next:
<h6>Foo</h6>
<div>A div, not a p</div>
<h6>Bar</h6>
<p>This <i>is</i> a paragraph</p>
In this case, search for either a <p> or a <h6> tag:
for header in soup.find_all('h6'):
    next_sibling = header.find_next_sibling(['p', 'h6'])
    if next_sibling.name == 'h6':
        # no <p> tag between this header and the next, skip
        continue
The header.find_next_sibling(['p', 'h6']) call will either find the next paragraph, or the next header, whichever comes first. 
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With