Parsing html with Python

Question

I'm using BeautifulSoup to extract some data from a search result from this website http://www.cpso.on.ca/docsearch/default.aspx

Here's a sample of the HTML code that's been .prettify()

<tr>
 <td>
  <a class="doctor" href="details.aspx?view=1&amp;id= 72374">
   Smith, Jane
  </a>
  (#72374)
 </td>
 <td>
  Suite 042
  <br />
  21 Jump St
  <br />
  Toronto&nbsp;ON&nbsp;&nbsp;M4C 5T2
  <br />
  Phone:&nbsp;(555) 555-5555
  <br />
  Fax:&nbsp;(555) 555-555
 </td>
 <td align="center">
 </td>
</tr>

Essentially every <tr> block has 3 <td> blocks.

I want the output to be

Smith, Jane Suite 042 21 Jump St Toronto ON M4C 5T2

I also have to separate entries by a new line.

I'm having problem writing the address which is stored in the 2nd <td> block.

I'm also writing this to a file.

Here's what I have so far... it doesn't work :p

for tr in soup.findAll('tr'):
    #td1 = tr.td
    td2 = tr.td.nextSibling.nextSibling 

    for a in tr.findAll('a'):
        target.write(a.string)
        target.write(" ")

    for i in range(len(td2.contents)):
        if i != None:
            target.write(td2.contents[i].string)
            target.write(" ")
    target.write("
")

ekhumoro · Accepted Answer

This should do most of what you want:

import os
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

with open('output.txt', 'wb') as stream:
    for tr in soup.findAll('tr')[1:]: # [1:] skips the header
        columns = tr.findAll('td')
        line = [columns[0].a.string.strip()]
        for item in (item.strip() for item in columns[1].findAll(text=True)):
            if (item and not item.startswith('Phone:')
                and not item.startswith('Fax:')):
                line.append(item)
        stream.write(' '.join(line).encode('utf-8'))
        stream.write(os.linesep)

UPDATE

Added some code to show how to write the names and addresses to file.

Also changed the output so that names and addresses are written on one line, and phone and fax numbers are omitted.

Parsing html with Python

Tags:

python

html

beautifulsoup

KylePDM

1 Answers

ekhumoro

Recent Activity

Donate For Us

Parsing html with Python

Tags:

python

html

beautifulsoup

KylePDM

1 Answers

ekhumoro

Related questions

Recent Activity

Donate For Us