Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape page with BeautifulSoup? Page Source not matching Inspect Element

I'm trying to scrape a few things from this fantasy basketball page. I'm using BeautifulSoup in Python 3.5+ to do this.

source_code = requests.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975')
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')

To begin with, I'd like to scrape the titles for the 9 categories into a Python list. So my list should look like categories = [FG%, FT%, 3PM, REB, AST, STL, BLK, TO, PTS].

What I hoped to do is something like the following:

tableSubHead = soup.find_all('tr', class_='Table2__header-row')
tableSubHead = tableSubHead[0]
listCats = tableSubHead.find_all('th')
categories = []
for cat in listCats:
  if 'title' in cat.attrs:
  categories.append(cat.string)

However, the soup.find_all('tr', class_='Table2__header-row') returns an empty list instead of the table row element I want. I suspect this is because when I view the page source, it's completely different from Inspect Element in Chrome Dev Tools. I understand this is because Javascript changes the elements on the page dynamically, but I'm not sure what the solution would be.

like image 646
Warren Crasta Avatar asked Oct 26 '25 08:10

Warren Crasta


1 Answers

The problem you're facing is because this website is a web-app, which means javascript will have to run to generate what you're seeing, you can't run javascript with request, here's what I did to get the result with selenium which opens a headless browser and enable javascript to run first by waiting for a period of time:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time

# source_code = requests.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975')

options = webdriver.ChromeOptions()
options.add_argument('headless')
capa = DesiredCapabilities.CHROME
capa["pageLoadStrategy"] = "none"
driver = webdriver.Chrome(chrome_options=options, desired_capabilities=capa)
driver.set_window_size(1440,900)
driver.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975')
time.sleep(15)

plain_text = driver.page_source
soup = BeautifulSoup(plain_text, 'lxml')

soup.select('.Table2__header-row') # Returns full results.

len(soup.select('.Table2__header-row')) # 8

This approach will allow you to run website that are designed as a webapp, and greatly expand your functionality. - you can even add commands to execute like scrolling or clicking to load more sources on the flight.

Use pip install selenium to install selenium. Also allows you to use Firefox if you prefer that browser.

like image 68
Rocky Li Avatar answered Oct 28 '25 23:10

Rocky Li