I am trying to use urllib to grab a html page, then use beautifulsoup to extract data out. I want to get all the number from comments_42.html and print out the sum of them, then display the numbers of data. Here is my code, I am trying to use regex, but it doesn't work for me.
import urllib
from bs4 import BeautifulSoup
url = 'http://python-data.dr-chuck.net/comments_42.html'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags = soup('span')
for tag in tags:
print tag
Use findAll() method of BeautifulSoup to extract all span tags with class 'comments', since they contain the information you need. You can then perform any operation on them depending on your requirements.
soup = BeautifulSoup(html,"html.parser")
data = soup.findAll("span", { "class":"comments" })
numbers = [d.text for d in data]
Here is the output:
[u'100', u'97', u'87', u'86', u'86', u'78', u'75', u'74', u'72', u'72', u'72', u'70', u'70', u'66', u'66', u'65', u'65', u'63', u'61', u'60', u'60', u'59', u'59', u'57', u'56', u'54', u'52', u'52', u'51', u'47', u'47', u'41', u'41', u'41', u'38', u'35', u'32', u'31', u'24', u'19', u'19', u'18', u'17', u'16', u'13', u'8', u'7', u'1', u'1', u'1']
I am taking the same course from Coursera as you are. Instead of going for the above solutions, do you mind trying this one. I feel this one is within the scope of what we had learnt till the above mentioned problem. It absolutely worked for me.
import urllib
import re
from bs4 import *
url = 'http://python-data.dr-chuck.net/comments_216543.html'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
# Look at the parts of a tag
y=str(tag)
x= re.findall("[0-9]+",y)
for i in x:
i=int(i)
sum=sum+i
print sum
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With