Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - Get spider variables inside DOWNLOAD MIDDLEWARE __init__

Tags:

python

scrapy

I'm working on a Scrapy project, on which I wrote a DOWNLOADER MIDDLEWARE to avoid making requests to URLs that are already on the database.

DOWNLOADER_MIDDLEWARES = {
   'imobotS.utilities.RandomUserAgentMiddleware': 400,
   'imobotS.utilities.DupFilterMiddleware': 500,
   'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

The idea is to connect and load on __init__ a distinct list of all the urls currently stored on DB, and raise IgnoreRequests if the scraped item is already on DB.

class DuplicateFilterMiddleware(object):

    def __init__(self):
        connection = pymongo.Connection('localhost', 12345)
        self.db = connection['my_db']
        self.db.authenticate('scott', '*****')

        self.url_set = self.db.ad.find({'site': 'WEBSITE_NAME'}).distinct('url')

    def process_request(self, request, spider):
        print "%s - process Request URL: %s" % (spider._site_name, request.url)
        if request.url in self.url_set:
            raise IgnoreRequest("Duplicate --db-- item found: %s" % request.url)
        else:
            return None

So, as I want to restrict the url_list defined on init by WEBSITE_NAME, is there a way to identify the current spider name inside the Download Middleware __init__ method?

like image 985
André Teixeira Avatar asked Oct 27 '25 16:10

André Teixeira


1 Answers

Building on what @Ahsan Roy said above, you don't have to use the signals API (at least in Scrapy 2.4.0):

From the from_crawler method, you have access to the spider (with its name) as well as all the other spider settings. You can use this to pass any argument you want into the constructor of your middleware class (i.e. __init__):

class DuplicateFilterMiddleware(object):

    @classmethod
    def from_crawler(cls, crawler):
        """This method is called by Scrapy and needs to return an instance of the middleware"""
        return cls(crawler.spider, crawler.settings)

    def __init__(self, spider, settings):
        self.spider_name = spider.name
        self.settings = settings

    def process_request(self, request, spider):
        print("spider {s} is processing stuff".format(s=self.spider_name))
        return None  # keep processing normally
like image 199
Lorenz Avatar answered Oct 29 '25 07:10

Lorenz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!