Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fixing Broken HTML in Python - Beautifulsoup not working

I'm interested in scraping text from this table: https://ows.doleta.gov/unemploy/trigger/2011/trig_100211.html as well as others like it.

I wrote a quick python script that works for other tables formatted in a similar way:

    state = ""
    weeks = ""
    edate = "" 
    pdate = url[-11:]
    pdate = pdate[:-5]

    table = soup.find("table") 

    for row in table.findAll('tr'):     
        cells = row.findAll("td")
        if len(cells) == 13: 
            state = row.find("th").find(text=True) 
            weeks = cells[11].find(text=True) 
            edate = cells[12].find(text=True)
            try:   
                print pdate, state, weeks, edate 
                f.writerow([pdate, state, weeks, edate])
            except:  
                print state[1] + " error"  

But, the script doesn't work for this table, because the tags are broken for half of the rows. Half of the rows are formatted without tags to signal the beginning of a row:

</tr> #end of last row, on State0  
<td headers = "State1 no info", attributes> <FONT attributes> text </FONT> </td>
<td headers = "State1 no info", attributes> <FONT attributes> text </FONT> </td>
<td headers = "State1 no info", attributes> <FONT attributes> text </FONT> </td>
<td headers = "State1 no info", attributes> <FONT attributes> text </FONT> </td>
</tr> #theoretically, end of row about State1 

Because half of the rows aren't properly formatted, BeautifulSoup disregards them. I've tried fixing the problem with tidy, but BeautifulSoup has problems reading the code it suggests. I've thought about fixing the problem by generating a new string with tags in the right places, but I'm not sure how to do that.

Any suggestions?

like image 222
Kay Avatar asked Mar 21 '26 07:03

Kay


1 Answers

Since different parsers are free to handle broken HTML as they see fit, it's often useful in these cases to explore how they do so before trying to fix it yourself.

In this case you may be interested in how html5lib handles this - it looks to me like it inserts the missing <tr> elements instead of discarding all of the orphaned <td> elements like lxml (the default) does.

soup = BeautifulSoup(text) #default parser - lxml

soup.table.find_all('tr')[9]
Out[31]: 
<tr bgcolor="#C0C0C0">
<td align="center" headers="Arizona noinfo" width="25"><font size="-2"> </font></td>
<td align="center" headers="Arizona noinfo" width="25"><font size="-2"> </font></td>
<td align="center" headers="Arizona noinfo" width="25"><font size="-2"> </font></td>
<th align="left" id="Arizona " width="100"><font size="-2">Arizona </font></th>
<td align="center" headers="Arizona noinfo" width="50"><font size="-2">2</font></td>
<td align="center" headers="Arizona noinfo" width="50"><font size="-2">2</font></td>
<td align="center" headers="Arizona 13_week_IUR indicators" width="50"><font size="-2">3.03</font></td>
<td align="center" headers="Arizona pct_of_prior_2years indicators" width="50"><font size="-2">79</font></td>
<td align="center" headers="Arizona 3_mo_satur indicators" width="50"><font size="-2">9.3</font></td>
<td align="center" headers="Arizona year pct_of_prior indicators" width="50"><font size="-2">94</font></td>
<td align="center" headers="Arizona 2nd_year pct_of_prior indicators" width="50"><font size="-2">93</font></td>
<td align="center" headers="Arizona 2nd_year pct_of_prior indicators" width="50"><font size="-2">155</font></td>
<td align="center" headers="Arizona avail_wks pct_of_prior indicators noinfo" width="50"><font size="-2"> </font></td>
<td align="center" headers="Arizona dates periods status" width="100"><font size="-2">E 06-11-2011</font></td>
</tr>

soup = BeautifulSoup(text, 'html5lib')

soup.table.find_all('tr')[9] #same path, different result!
Out[33]: 
<tr><td align="center" headers="Alaska noinfo" width="25"><font size="-2"> </font></td>
<td align="center" headers="Alaska noinfo" width="25"><font size="-2"> </font></td>
<td align="center" headers="Alaska noinfo" width="25"><font size="-2"> </font></td>
<th align="left" id="Alaska " width="100"><font size="-2">Alaska </font></th>
<td align="center" headers="Alaska noinfo" width="50"><font size="-2">2</font></td>
<td align="center" headers="Alaska noinfo" width="50"><font size="-2">2</font></td>
<td align="center" headers="Alaska 13_week_IUR indicators" width="50"><font size="-2">3.82</font></td>
<td align="center" headers="Alaska pct_of_prior_2years indicators" width="50"><font size="-2">90</font></td>
<td align="center" headers="Alaska 3_mo_satur indicators" width="50"><font size="-2">7.6</font></td>
<td align="center" headers="Alaska year pct_of_prior indicators" width="50"><font size="-2">96</font></td>
<td align="center" headers="Alaska 2nd_year pct_of_prior indicators" width="50"><font size="-2">95</font></td>
<td align="center" headers="Alaska 2nd_year pct_of_prior indicators" width="50"><font size="-2">117</font></td>
<td align="center" headers="Alaska avail_wks pct_of_prior indicators noinfo" width="50"><font size="-2"> </font></td>
<td align="center" headers="Alaska dates periods status" width="100"><font size="-2">E 06-11-2011</font></td>
</tr>

More at bs4 docs: Differences Between Parsers. Since this table appears okay when rendered in a browser, and html5lib attempts to parse pages the same way a browser does, it's probably a safe bet that that's what you want.

like image 73
roippi Avatar answered Mar 23 '26 08:03

roippi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!