Parsing a site with dynamic content

Question

I'm using Nokogiri to parse TechCrunch [with a specific search term.

http://techcrunch.com/search/education#stq=education&stp=1

The problem is that the site has a delay of a few seconds before it returns a list related to the search item, so the URL I input to Nokogiri to parse is empty of relevant content when Nokogiri retrieves it.

The content appears to load after a couple of seconds dynamically - I'm guessing Javascript. Any ideas of how to retrieve the HTML with a slight delay?

Jason Kim · Accepted Answer

Use Ruby method, sleep

seconds_to_delay = 5
sleep seconds_to_delay

Edit 1: Dealing with divs that load some time after the document finishes loading

I hate this scenario. I had to deal with exact same scenario, so here's how I solved it. You need to use something like selenium-webdriver gem.

require 'selenium-webdriver'
url = "http://techcrunch.com/search/education#stq=education&stp=1"

css_selector = ".tab-panel.active"

driver = Selenium::WebDriver.for :firefox
driver.get(url)
driver.switch_to.default_content
posts_text = driver.find_element(:css, css_selector).text
puts posts_text
driver.quit

If you are running this on some virtual machine on Heroku, AWS EC2 or Digital Ocean and stuff, you can't use firefox. Instead you need a headless browser like phantom.js.

In order to use phantom.js instead of firefox, first, install phantomjs on the VM. Then change to driver = Selenium::WebDriver.for :phantomjs.

You can use this gem that actually installs phantomjs for you.

Second edit for question b)

require 'selenium-webdriver'
url = "http://techcrunch.com/search/education#stq=education&stp=1"

css_selector = ".tab-panel.active ul.river-compact.river-search li"

driver = Selenium::WebDriver.for :phantomjs
driver.get(url)
driver.switch_to.default_content
items = driver.find_elements(:css, css_selector)
items.each {|x| puts x }
driver.quit

Parsing a site with dynamic content

Tags:

javascript

parsing

ruby

ruby-on-rails

nokogiri

Jonathan_W

1 Answers

Jason Kim

Recent Activity

Donate For Us

Parsing a site with dynamic content

Tags:

javascript

parsing

ruby

ruby-on-rails

nokogiri

Jonathan_W

1 Answers

Jason Kim

Related questions

Recent Activity

Donate For Us