Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping a dynamically/Javascript generated website with Python/Selenium

I'm trying to scrape this website:

http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210

using Python and Selenium (see code below). The content is dynamically generated, and apparently data which is not visible in the browser is not loaded. I have tried making the browser window larger, and scrolling to the bottom of the page. Enlarging the window gets me all the data I want in the horizontal direction, but there is still plenty of data to scrape in the vertical direction. The scrolling appears not to work at all.

Does anyone have any bright ideas about how to do this?

Thanks!

from selenium import webdriver
import time

url = "http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210"
driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

time.sleep(5) # wait to load

soup = BeautifulSoup(driver.page_source)

table = soup.find("table", {"id":"DataTable"})

### get data
thead = table.find('tbody')
loopRows = thead.findAll('tr')
rows = []
for row in loopRows:
rows.append([val.text.encode('ascii', 'ignore') for val in  row.findAll(re.compile('td|th'))])
with open("body.csv", 'wb') as test_file:
  file_writer = csv.writer(test_file)
  for row in rows:
      file_writer.writerow(row)
like image 673
Karl Nordgren Avatar asked Jan 31 '26 20:01

Karl Nordgren


1 Answers

This will get you as far as autosaving the entire csv to disk, but I haven't found a robust way to determine when the download is complete:

import os
import contextlib
import selenium.webdriver as webdriver
import csv
import time

url = "http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210"
download_dir = '/tmp'
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.dir", download_dir)
# 2 means "use the last folder specified for a download"
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-csv")

# driver = webdriver.Firefox(firefox_profile=fp)
with contextlib.closing(webdriver.Firefox(firefox_profile=fp)) as driver:
    driver.get(url)
    driver.execute_script("onDownload(2);")
    csvfile = os.path.join(download_dir, 'download.csv')

    # Wait for the download to complete
    time.sleep(10)
    with open(csvfile, 'rb') as f:
        for line in csv.reader(f, delimiter=','):
            print(line)

Explanation:

Point your browser to url. You'll see there is an Actions menu with an option to Download report data... and a suboption entitled "Comma-delimited ASCII format (*.csv)". If you inspect the HTML for these words you'll find

"Comma-delimited ASCII format (*.csv)","","javascript:onDownload(2);"

So it follows naturally that you might try getting webdriver to execute the JavaScript function call onDownload(2). We can do that with

driver.execute_script("onDownload(2);")

but normally another window will then pop up asking if you want save the file. To automate the saving-to-disk, I used the method described in this FAQ. The tricky part is finding the correct MIME type to specify on this line:

fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-csv")

The curl method described in the FAQ does not work here since we do not have a url for the csv file. However, this page describes another way to find the MIME type: Use a Firefox browser to open the save dialog. Check the checkbox saying "Do this automatically for files like this". Then inspect the last few lines of ~/.mozilla/firefox/*/mimeTypes.rdf for the most recently added description:

  <RDF:Description RDF:about="urn:mimetype:handler:application/x-csv"
                   NC:alwaysAsk="false"
                   NC:saveToDisk="true">
    <NC:externalApplication RDF:resource="urn:mimetype:externalApplication:application/x-csv"/>
  </RDF:Description>

This tells us the mime type is "application/x-csv". Bingo, we are in business.

like image 185
unutbu Avatar answered Feb 03 '26 10:02

unutbu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!