I couldn't configure scrapy to run with depth > 1, I have tried the 3 following options, noone of them worked and the request_depth_max at summary log is always 1:
1) Adding:
from scrapy.conf import settings
settings.overrides['DEPTH_LIMIT'] = 2
to spider file (the example on site, just with different site)
2) Running the command line with -s option:
/usr/bin/scrapy crawl -s DEPTH_LIMIT=2 mininova.org
3) Adding to settings.py and scrapy.cfg:
DEPTH_LIMIT=2
How should it be configured to more than 1?
warwaruk is right, The default value of DEPTH_LIMIT setting is 0 - i.e. "no limit is imposed".
So let's scrape miniova and see what happens. Starting at the today page we see that there are two tor links:
stav@maia:~$ scrapy shell http://www.mininova.org/today
2012-08-15 12:27:57-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'[APSKAFT-018] Apskaft presents: Musique Concrte', fragment='', nofollow=False), Link(url='http://www.mininova.org/tor/13204737', text=u'e4g020-graphite412', fragment='', nofollow=False)]
Let's scrape the first link, where we see there are no new tor links on that page, just the link to iteself, which does not get recrawled by default (scrapy.http.Request(url[, ... dont_filter=False, ...])):
>>> fetch('http://www.mininova.org/tor/13204738')
2012-08-15 12:30:11-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204738> (referer: None)
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'General information', fragment='', nofollow=False)]
No luck there, we are still at depth 1. Let's try the other link:
>>> fetch('http://www.mininova.org/tor/13204737')
2012-08-15 12:31:20-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204737> (referer: None)
[Link(url='http://www.mininova.org/tor/13204737', text=u'General information', fragment='', nofollow=False)]
Nope, this page only contains one link as well, a link to itself, which also gets filtered. So there are actually no links to scrape, so Scrapy closes the spider (at depth==1).
I had a similar issue, it helped to set follow=True when defining Rule:
followis a boolean which specifies if links should be followed from each response extracted with this rule. IfcallbackisNonefollowdefaults toTrue, otherwise it default toFalse.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With