I'm using BeautifulSoup to extract some data from a search result from this website http://www.cpso.on.ca/docsearch/default.aspx
Here's a sample of the HTML code that's been .prettify()
<tr>
<td>
<a class="doctor" href="details.aspx?view=1&id= 72374">
Smith, Jane
</a>
(#72374)
</td>
<td>
Suite 042
<br />
21 Jump St
<br />
Toronto ON M4C 5T2
<br />
Phone: (555) 555-5555
<br />
Fax: (555) 555-555
</td>
<td align="center">
</td>
</tr>
Essentially every <tr> block has 3 <td> blocks.
I want the output to be
Smith, Jane Suite 042 21 Jump St Toronto ON M4C 5T2
I also have to separate entries by a new line.
I'm having problem writing the address which is stored in the 2nd <td> block.
I'm also writing this to a file.
Here's what I have so far... it doesn't work :p
for tr in soup.findAll('tr'):
#td1 = tr.td
td2 = tr.td.nextSibling.nextSibling
for a in tr.findAll('a'):
target.write(a.string)
target.write(" ")
for i in range(len(td2.contents)):
if i != None:
target.write(td2.contents[i].string)
target.write(" ")
target.write("\n")
This should do most of what you want:
import os
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
with open('output.txt', 'wb') as stream:
for tr in soup.findAll('tr')[1:]: # [1:] skips the header
columns = tr.findAll('td')
line = [columns[0].a.string.strip()]
for item in (item.strip() for item in columns[1].findAll(text=True)):
if (item and not item.startswith('Phone:')
and not item.startswith('Fax:')):
line.append(item)
stream.write(' '.join(line).encode('utf-8'))
stream.write(os.linesep)
UPDATE
Added some code to show how to write the names and addresses to file.
Also changed the output so that names and addresses are written on one line, and phone and fax numbers are omitted.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With