I'm using Nokogiri to parse TechCrunch [with a specific search term.
http://techcrunch.com/search/education#stq=education&stp=1
The problem is that the site has a delay of a few seconds before it returns a list related to the search item, so the URL I input to Nokogiri to parse is empty of relevant content when Nokogiri retrieves it.
The content appears to load after a couple of seconds dynamically - I'm guessing Javascript. Any ideas of how to retrieve the HTML with a slight delay?
Use Ruby method, sleep
seconds_to_delay = 5
sleep seconds_to_delay
Edit 1: Dealing with divs that load some time after the document finishes loading
I hate this scenario. I had to deal with exact same scenario, so here's how I solved it. You need to use something like selenium-webdriver gem.
require 'selenium-webdriver'
url = "http://techcrunch.com/search/education#stq=education&stp=1"
css_selector = ".tab-panel.active"
driver = Selenium::WebDriver.for :firefox
driver.get(url)
driver.switch_to.default_content
posts_text = driver.find_element(:css, css_selector).text
puts posts_text
driver.quit
If you are running this on some virtual machine on Heroku, AWS EC2 or Digital Ocean and stuff, you can't use firefox. Instead you need a headless browser like phantom.js.
In order to use phantom.js instead of firefox, first, install phantomjs on the VM. Then change to driver = Selenium::WebDriver.for :phantomjs.
You can use this gem that actually installs phantomjs for you.
Second edit for question b)
require 'selenium-webdriver'
url = "http://techcrunch.com/search/education#stq=education&stp=1"
css_selector = ".tab-panel.active ul.river-compact.river-search li"
driver = Selenium::WebDriver.for :phantomjs
driver.get(url)
driver.switch_to.default_content
items = driver.find_elements(:css, css_selector)
items.each {|x| puts x }
driver.quit
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With