Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing html with Python

I'm using BeautifulSoup to extract some data from a search result from this website http://www.cpso.on.ca/docsearch/default.aspx

Here's a sample of the HTML code that's been .prettify()

<tr>
 <td>
  <a class="doctor" href="details.aspx?view=1&amp;id= 72374">
   Smith, Jane
  </a>
  (#72374)
 </td>
 <td>
  Suite 042
  <br />
  21 Jump St
  <br />
  Toronto&nbsp;ON&nbsp;&nbsp;M4C 5T2
  <br />
  Phone:&nbsp;(555) 555-5555
  <br />
  Fax:&nbsp;(555) 555-555
 </td>
 <td align="center">
 </td>
</tr>

Essentially every <tr> block has 3 <td> blocks.

I want the output to be

Smith, Jane Suite 042 21 Jump St Toronto ON M4C 5T2

I also have to separate entries by a new line.

I'm having problem writing the address which is stored in the 2nd <td> block.

I'm also writing this to a file.

Here's what I have so far... it doesn't work :p

for tr in soup.findAll('tr'):
    #td1 = tr.td
    td2 = tr.td.nextSibling.nextSibling 

    for a in tr.findAll('a'):
        target.write(a.string)
        target.write(" ")

    for i in range(len(td2.contents)):
        if i != None:
            target.write(td2.contents[i].string)
            target.write(" ")
    target.write("\n")
like image 993
KylePDM Avatar asked Jan 27 '26 16:01

KylePDM


1 Answers

This should do most of what you want:

import os
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

with open('output.txt', 'wb') as stream:
    for tr in soup.findAll('tr')[1:]: # [1:] skips the header
        columns = tr.findAll('td')
        line = [columns[0].a.string.strip()]
        for item in (item.strip() for item in columns[1].findAll(text=True)):
            if (item and not item.startswith('Phone:')
                and not item.startswith('Fax:')):
                line.append(item)
        stream.write(' '.join(line).encode('utf-8'))
        stream.write(os.linesep)

UPDATE

Added some code to show how to write the names and addresses to file.

Also changed the output so that names and addresses are written on one line, and phone and fax numbers are omitted.

like image 166
ekhumoro Avatar answered Jan 29 '26 05:01

ekhumoro