I've been trying to scrape the table from here but it seems to me that BeautifulSoup doesn't find any table.
I wrote:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import csv
url = "http://www.payscale.com/college-salary-report/bachelors?page=65"
r=requests.get(url)
data=r.text
soup=BeautifulSoup(data,'xml')
table=soup.find_all('table')
print table #prints nothing..
Based on other similar questions, I assume that the HTML is broken in someway, but I'm not an expert.. Couldn't find an answer in those: (Beautiful soup missing some html table tags), (Extracting a table from a website), (Scraping a table using BeautifulSoup), or even (Python+BeautifulSoup: scraping a particular table from a webpage)
Thanks a bunch!
You are parsing html but you used xml parser.
You should use soup=BeautifulSoup(data,"html.parser")
Your necessary data is in script tag, in fact there is no table tag actually. So, you need to find texts inside script.
N.B: If you are using Python 2.x then use "HTMLParser" instead of "html.parser".
Here is the code.
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.payscale.com/college-salary-report/bachelors?page=65"
r=requests.get(url)
data=r.text
soup=BeautifulSoup(data,"html.parser")
scripts = soup.find_all("script")
file_name = open("table.csv","w",newline="")
writer = csv.writer(file_name)
list_to_write = []
list_to_write.append(["Rank","School Name","School Type","Early Career Median Pay","Mid-Career Median Pay","% High Job Meaning","% STEM"])
for script in scripts:
text = script.text
start = 0
end = 0
if(len(text) > 10000):
while(start > -1):
start = text.find('"School Name":"',start)
if(start == -1):
break
start += len('"School Name":"')
end = text.find('"',start)
school_name = text[start:end]
start = text.find('"Early Career Median Pay":"',start)
start += len('"Early Career Median Pay":"')
end = text.find('"',start)
early_pay = text[start:end]
start = text.find('"Mid-Career Median Pay":"',start)
start += len('"Mid-Career Median Pay":"')
end = text.find('"',start)
mid_pay = text[start:end]
start = text.find('"Rank":"',start)
start += len('"Rank":"')
end = text.find('"',start)
rank = text[start:end]
start = text.find('"% High Job Meaning":"',start)
start += len('"% High Job Meaning":"')
end = text.find('"',start)
high_job = text[start:end]
start = text.find('"School Type":"',start)
start += len('"School Type":"')
end = text.find('"',start)
school_type = text[start:end]
start = text.find('"% STEM":"',start)
start += len('"% STEM":"')
end = text.find('"',start)
stem = text[start:end]
list_to_write.append([rank,school_name,school_type,early_pay,mid_pay,high_job,stem])
writer.writerows(list_to_write)
file_name.close()
This will generate your necessary table in csv. Don't forget to close the file when you are done.
While this won't find the table that's not in r.text, you are asking BeautifulSoup to use the xml parser instead of html.parser so I would recommend changing that line to:
soup=BeautifulSoup(data,'html.parser')
One of the issues you will run into with web scraping is what are called "client-rendered" websites versus server-rendered. Basically, this means that the page you would get from a basic html request through the requests module or through curl for example is not the same content that would be rendered in a web browser. Some of the common frameworks for this are React and Angular. If you examine the source of the page you are wanting to scrape, they have data-react-ids on several of their html elements. A common tell for Angular pages are similar element attributes with the prefix ng, e.g. ng-if or ng-bind. You can see the page's source in Chrome or Firefox through their respective dev tools, which can be launched with the keyboard shortcut Ctrl+Shift+I in either browser. It's worth noting that not all React & Angular pages are only client-rendered.
In order to get this sort of content, you would need to use a headless browser tool like Selenium. There are many resources on web scraping with Selenium and Python.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With