Scrapy

Question

I'm working on a Scrapy project, on which I wrote a DOWNLOADER MIDDLEWARE to avoid making requests to URLs that are already on the database.

DOWNLOADER_MIDDLEWARES = {
   'imobotS.utilities.RandomUserAgentMiddleware': 400,
   'imobotS.utilities.DupFilterMiddleware': 500,
   'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

The idea is to connect and load on __init__ a distinct list of all the urls currently stored on DB, and raise IgnoreRequests if the scraped item is already on DB.

class DuplicateFilterMiddleware(object):

    def __init__(self):
        connection = pymongo.Connection('localhost', 12345)
        self.db = connection['my_db']
        self.db.authenticate('scott', '*****')

        self.url_set = self.db.ad.find({'site': 'WEBSITE_NAME'}).distinct('url')

    def process_request(self, request, spider):
        print "%s - process Request URL: %s" % (spider._site_name, request.url)
        if request.url in self.url_set:
            raise IgnoreRequest("Duplicate --db-- item found: %s" % request.url)
        else:
            return None

So, as I want to restrict the url_list defined on init by WEBSITE_NAME, is there a way to identify the current spider name inside the Download Middleware __init__ method?

Lorenz · Accepted Answer

Building on what @Ahsan Roy said above, you don't have to use the signals API (at least in Scrapy 2.4.0):

From the from_crawler method, you have access to the spider (with its name) as well as all the other spider settings. You can use this to pass any argument you want into the constructor of your middleware class (i.e. __init__):

class DuplicateFilterMiddleware(object):

    @classmethod
    def from_crawler(cls, crawler):
        """This method is called by Scrapy and needs to return an instance of the middleware"""
        return cls(crawler.spider, crawler.settings)

    def __init__(self, spider, settings):
        self.spider_name = spider.name
        self.settings = settings

    def process_request(self, request, spider):
        print("spider {s} is processing stuff".format(s=self.spider_name))
        return None  # keep processing normally

Scrapy - Get spider variables inside DOWNLOAD MIDDLEWARE init

Tags:

python

André Teixeira

1 Answers

Lorenz

Recent Activity

Donate For Us

Scrapy - Get spider variables inside DOWNLOAD MIDDLEWARE __init__