I'm trying to scrape some dynamic content using Scrapy. I have succesfully set up Splash to work along with it. However, the selectors of the following spider yield empty results:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy_splash import SplashRequest
class CartierSpider(scrapy.Spider):
name = 'cartier'
start_urls = ['http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
yield {
'title': response.xpath('//title').extract(),
'link': response.url,
'productID': Selector(text=response.body).xpath('//span[@itemprop="productID"]/text()').extract(),
'model': Selector(text=response.body).xpath('//span[@itemprop="model"]/text()').extract(),
'price': Selector(text=response.body).css('div.price-wrapper').xpath('.//span[@itemprop="price"]/text()').extract(),
}
The selectors work just fine using the Scrapy shell, so I'm very confused about what is not working.
The only difference I can find among the two situations is that the encoding of the string response.body is treated differently: it's just gibberish if I try to print/decode it from within the parse function.
Any hint or reference would be greatly appreciated.
Your spider works fine with me, with Scrapy 1.1, Splash 2.1 and no modification of the code in your question, just using settings suggested in https://github.com/scrapy-plugins/scrapy-splash
As other have mentioned, your parse function can be simplified by using response.css() and response.xpath() directly, without needing to re-build a Selector from the response.
I tried with:
import scrapy
from scrapy.selector import Selector
from scrapy_splash import SplashRequest
class CartierSpider(scrapy.Spider):
name = 'cartier'
start_urls = ['http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
yield {
'title': response.xpath('//title/text()').extract_first(),
'link': response.url,
'productID': response.xpath('//span[@itemprop="productID"]/text()').extract_first(),
'model': response.xpath('//span[@itemprop="model"]/text()').extract_first(),
'price': response.css('div.price-wrapper').xpath('.//span[@itemprop="price"]/text()').extract_first(),
}
and got this:
$ scrapy crawl cartier
2016-06-08 17:16:08 [scrapy] INFO: Scrapy 1.1.0 started (bot: stack37701774)
2016-06-08 17:16:08 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'stack37701774.spiders', 'SPIDER_MODULES': ['stack37701774.spiders'], 'BOT_NAME': 'stack37701774'}
(...)
2016-06-08 17:16:08 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-06-08 17:16:08 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-06-08 17:16:08 [scrapy] INFO: Enabled item pipelines:
[]
2016-06-08 17:16:08 [scrapy] INFO: Spider opened
2016-06-08 17:16:08 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-08 17:16:08 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-08 17:16:11 [scrapy] DEBUG: Crawled (200) <GET http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html via http://localhost:8050/render.html> (referer: None)
2016-06-08 17:16:11 [scrapy] DEBUG: Scraped from <200 http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html>
{'model': u'Ballon Bleu de Cartier watch', 'productID': u'W69017Z4', 'link': 'http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html', 'price': None, 'title': u'CRW69017Z4 - Ballon Bleu de Cartier watch - 36 mm, steel, leather - Cartier'}
2016-06-08 17:16:11 [scrapy] INFO: Closing spider (finished)
2016-06-08 17:16:11 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 618,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 213006,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 6, 8, 15, 16, 11, 201281),
'item_scraped_count': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'splash/render.html/request_count': 1,
'splash/render.html/response_count/200': 1,
'start_time': datetime.datetime(2016, 6, 8, 15, 16, 8, 545105)}
2016-06-08 17:16:11 [scrapy] INFO: Spider closed (finished)
I've tried that SplashRequest and ran into the same problem as you did. After messing around I tried executing a LUA script instead.
script = """
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
"""
And then make the request using the script as an argument. You can mess around with the script. Test it on the shell at localhost:9200 or another port that you chose.
yield SplashRequest(
url,
self.parse, args={'lua_source': self.script}, endpoint='execute')
Oh and by the way, the way you yield info is just weird, use items instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With