I'm working on a Scrapy project, on which I wrote a DOWNLOADER MIDDLEWARE to avoid making requests to URLs that are already on the database.
DOWNLOADER_MIDDLEWARES = {
'imobotS.utilities.RandomUserAgentMiddleware': 400,
'imobotS.utilities.DupFilterMiddleware': 500,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
The idea is to connect and load on __init__ a distinct list of all the urls currently stored on DB, and raise IgnoreRequests if the scraped item is already on DB.
class DuplicateFilterMiddleware(object):
def __init__(self):
connection = pymongo.Connection('localhost', 12345)
self.db = connection['my_db']
self.db.authenticate('scott', '*****')
self.url_set = self.db.ad.find({'site': 'WEBSITE_NAME'}).distinct('url')
def process_request(self, request, spider):
print "%s - process Request URL: %s" % (spider._site_name, request.url)
if request.url in self.url_set:
raise IgnoreRequest("Duplicate --db-- item found: %s" % request.url)
else:
return None
So, as I want to restrict the url_list defined on init by WEBSITE_NAME, is there a way to identify the current spider name inside the Download Middleware __init__ method?
Building on what @Ahsan Roy said above, you don't have to use the signals API (at least in Scrapy 2.4.0):
From the from_crawler method, you have access to the spider (with its name) as well as all the other spider settings. You can use this to pass any argument you want into the constructor of your middleware class (i.e. __init__):
class DuplicateFilterMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
"""This method is called by Scrapy and needs to return an instance of the middleware"""
return cls(crawler.spider, crawler.settings)
def __init__(self, spider, settings):
self.spider_name = spider.name
self.settings = settings
def process_request(self, request, spider):
print("spider {s} is processing stuff".format(s=self.spider_name))
return None # keep processing normally
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With