Extract number from a website using beautifulsoup in Python

Question

I am trying to use urllib to grab a html page, then use beautifulsoup to extract data out. I want to get all the number from comments_42.html and print out the sum of them, then display the numbers of data. Here is my code, I am trying to use regex, but it doesn't work for me.

import urllib
from bs4 import BeautifulSoup
url = 'http://python-data.dr-chuck.net/comments_42.html'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags = soup('span')
for tag in tags:
    print tag

Learner · Accepted Answer

Use findAll() method of BeautifulSoup to extract all span tags with class 'comments', since they contain the information you need. You can then perform any operation on them depending on your requirements.

soup = BeautifulSoup(html,"html.parser")
data = soup.findAll("span", { "class":"comments" })
numbers = [d.text for d in data]

Here is the output:

[u'100', u'97', u'87', u'86', u'86', u'78', u'75', u'74', u'72', u'72',   u'72', u'70', u'70', u'66', u'66', u'65', u'65', u'63', u'61', u'60', u'60', u'59', u'59', u'57', u'56', u'54', u'52', u'52', u'51', u'47', u'47', u'41', u'41', u'41', u'38', u'35', u'32', u'31', u'24', u'19', u'19', u'18', u'17', u'16', u'13', u'8', u'7', u'1', u'1', u'1']

Tuhin · Answer

I am taking the same course from Coursera as you are. Instead of going for the above solutions, do you mind trying this one. I feel this one is within the scope of what we had learnt till the above mentioned problem. It absolutely worked for me.

import urllib
import re
from bs4 import *

url = 'http://python-data.dr-chuck.net/comments_216543.html'
html = urllib.urlopen(url).read()

soup = BeautifulSoup(html,"html.parser")
sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    y=str(tag)
    x= re.findall("[0-9]+",y)
    for i in x:
        i=int(i)
        sum=sum+i
print sum

Extract number from a website using beautifulsoup in Python

Tags:

python

regex

beautifulsoup

Saikorin

2 Answers

Learner

Tuhin

Recent Activity

Donate For Us

Extract number from a website using beautifulsoup in Python

Tags:

python

regex

beautifulsoup

Saikorin

2 Answers

Learner

Tuhin

Related questions

Recent Activity

Donate For Us