Scrapy

Question

So I am passing in a start_url that is a page of news articles(ex. cnn.com). But, I just want to extract the news article itself, I don't want to follow any links on article page. To do that, I'm using a CrawlSpider with the following rule:

rules = (
    Rule(LinkExtractor(allow=('regexToMatchArticleUrls',),
    deny=('someDenyUrls')), callback='parse_article_page'),
)

def parse_article_page(self,response): 
    #extracts the title, date, body, etc of article

I've enabled the scrapy.spidermiddlewares.depth.DepthMiddleware and set DEPTH_LIMIT = 1.

However, I'm still getting links crawled from the individual article pages that happen to match the regexToMatchArticleUrls, as they are links to other parts of the same website (and I cannot make the regex more restrictive).

But, why are these links getting crawled at all when the DEPTH_LIMIT=1? Is it because the DEPTH_LIMIT resets for each link extracted from LinkExtractor, ie. the article page urls? Is there a way either to make DEPTH_LIMIT work or extend the DepthMiddleware to not crawl links on the article page? Thanks!

VMRuiz · Accepted Answer

For the DepthMiddleware to work correctly the meta attribute needs to be passed from one request to another, otherwise, depth will be set to 0 after each new request.

Unfortunaly, by default, the CrawlSpider doesn't keep this meta attribute from one requests to the next.

This can be solved by using spider middlewares (middlewares.py):

from scrapy import Request


class StickyDepthSpiderMiddleware:

    def process_spider_output(self, response, result, spider):
        key_found = response.meta.get('depth', None)
        for x in result:
            if isinstance(x, Request) and key_found is not None:
                x.meta.setdefault('depth', key_found)
            yield x

Also, don't forget to include this middleware on your settings.py:

SPIDER_MIDDLEWARES = { '{your_project_name}.middlewares.StickyDepthSpiderMiddleware' : 100 }

Scrapy - LinkExtractor & setting DEPTH_LIMIT not working?

Tags:

python

web-scraping

ocean800

1 Answers

VMRuiz

Recent Activity

Donate For Us