I have a HTML page with about 50 tables on it. Each table has the same layout, but with different values, eg:
<table align="right" class="customTableClass">
<tr align="center">
<td width="25" height="25" class="usernum">value1</td>
<td width="25" height="25" class="usernum">value2</td>
<td width="25" height="25" class="usernum">value3</td>
<td width="25" height="25" class="usernum">value4</td>
<td width="25" height="25" class="usernum">value5</td>
<td width="25" height="25" class="usernum">value6</td>
<td width="25" height="25" class="totalnum">otherVal</td>
</tr>
</table>
My REST server is running django/python so in my urls.py I am calling my def parse_url(): function which obviously I want to do all the work in. My problem is, I'm pretty much a newbie when it comes to python, so literally just don't know where to put my code. I have gotten some code from the HTMLParser python docs, and changed it as follows:
import urllib, urllib2
from django.http import HttpResponse
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag
def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag
def handle_data(self, data):
HttpResponse("Encountered data %s" % data)
def parse_url(request):
p = MyHTMLParser()
url = 'http://www.mysite.com/lists.asp'
content = urllib.urlopen(url).read()
p.feed(content)
return HttpResponse('DONE')
This code, at the moment, doesnt output anything useful. It just prints out DONE, which isnt very useful.
How do I use the class methods such as handle_starttag()? Shouldnt these be called automatically when I use p.feed(content)??
Basically, what I'm trying to accomplish in the end is, when I go to mysite.com/showlist, to be able to output a list saying:
value1
value2
value3
value4
value5
value6
othervalue
This needs to be done in a loop, because there is roughly 50 tables with different values in each table.
Thanks for helping a beginner!!
You are printing the beginning of the answer to stdout, not django. Here is how to get HTMLParser to do your bidding:
import urllib, urllib2
from django.http import HttpResponse
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self, *args, **kwargs):
self.capture_data = False
self.data_list = []
HTMLParser.__init__(self, *args, **kwargs)
def handle_starttag(self, tag, attrs):
if tag == 'td':
self.capture_data = True
def handle_endtag(self, tag):
if tag == 'td':
self.capture_data = False
def handle_data(self, data):
if self.capture_data and data and not data.isspace():
self.data_list.append(data)
def parse_url(request):
p = MyHTMLParser()
url = 'http://www.mysite.com/lists.asp'
content = urllib.urlopen(url).read()
p.feed(content)
return HttpResponse(str(p.data_list))
I would recommend putting the class into a utils.py file and keeping in the same folder as your views.py. Then import it in. This will help keep your views.py manageable by only containing views.
Check out BeautifulSoup here is the documentation http://www.crummy.com/software/BeautifulSoup/documentation.html.
PS: It will be much more flexible including future requirements!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With