So I am passing in a start_url
that is a page of news articles(ex. cnn.com). But, I just want to extract the news article itself, I don't want to follow any links on article page. To do that, I'm using a CrawlSpider
with the following rule:
rules = (
Rule(LinkExtractor(allow=('regexToMatchArticleUrls',),
deny=('someDenyUrls')), callback='parse_article_page'),
)
def parse_article_page(self,response):
#extracts the title, date, body, etc of article
I've enabled the scrapy.spidermiddlewares.depth.DepthMiddleware
and set DEPTH_LIMIT = 1
.
However, I'm still getting links crawled from the individual article pages that happen to match the regexToMatchArticleUrls
, as they are links to other parts of the same website (and I cannot make the regex more restrictive).
But, why are these links getting crawled at all when the DEPTH_LIMIT=1
? Is it because the DEPTH_LIMIT
resets for each link extracted from LinkExtractor
, ie. the article page urls? Is there a way either to make DEPTH_LIMIT
work or extend the DepthMiddleware
to not crawl links on the article page? Thanks!
For the DepthMiddleware to work correctly the meta attribute needs to be passed from one request to another, otherwise, depth
will be set to 0 after each new request.
Unfortunaly, by default, the CrawlSpider doesn't keep this meta attribute from one requests to the next.
This can be solved by using spider middlewares (middlewares.py
):
from scrapy import Request
class StickyDepthSpiderMiddleware:
def process_spider_output(self, response, result, spider):
key_found = response.meta.get('depth', None)
for x in result:
if isinstance(x, Request) and key_found is not None:
x.meta.setdefault('depth', key_found)
yield x
Also, don't forget to include this middleware on your settings.py
:
SPIDER_MIDDLEWARES = { '{your_project_name}.middlewares.StickyDepthSpiderMiddleware' : 100 }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With