Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract number from a website using beautifulsoup in Python

I am trying to use urllib to grab a html page, then use beautifulsoup to extract data out. I want to get all the number from comments_42.html and print out the sum of them, then display the numbers of data. Here is my code, I am trying to use regex, but it doesn't work for me.

import urllib
from bs4 import BeautifulSoup
url = 'http://python-data.dr-chuck.net/comments_42.html'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags = soup('span')
for tag in tags:
    print tag
like image 720
Saikorin Avatar asked Oct 28 '25 16:10

Saikorin


2 Answers

Use findAll() method of BeautifulSoup to extract all span tags with class 'comments', since they contain the information you need. You can then perform any operation on them depending on your requirements.

soup = BeautifulSoup(html,"html.parser")
data = soup.findAll("span", { "class":"comments" })
numbers = [d.text for d in data]

Here is the output:

[u'100', u'97', u'87', u'86', u'86', u'78', u'75', u'74', u'72', u'72',   u'72', u'70', u'70', u'66', u'66', u'65', u'65', u'63', u'61', u'60', u'60', u'59', u'59', u'57', u'56', u'54', u'52', u'52', u'51', u'47', u'47', u'41', u'41', u'41', u'38', u'35', u'32', u'31', u'24', u'19', u'19', u'18', u'17', u'16', u'13', u'8', u'7', u'1', u'1', u'1']
like image 166
Learner Avatar answered Oct 30 '25 06:10

Learner


I am taking the same course from Coursera as you are. Instead of going for the above solutions, do you mind trying this one. I feel this one is within the scope of what we had learnt till the above mentioned problem. It absolutely worked for me.

import urllib
import re
from bs4 import *

url = 'http://python-data.dr-chuck.net/comments_216543.html'
html = urllib.urlopen(url).read()

soup = BeautifulSoup(html,"html.parser")
sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    y=str(tag)
    x= re.findall("[0-9]+",y)
    for i in x:
        i=int(i)
        sum=sum+i
print sum
like image 25
Tuhin Avatar answered Oct 30 '25 05:10

Tuhin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!