I have just started using scrapy and I would like a way to persist the URLs that have previously been crawled so I can run subsequent scrapes and only take in new data from unknown URLs. I am seeing a couple of different ways to filter for duplicates and a couple of ways to persist data. I would like to know what the recommended way for doing these activities is in scrapy version 0.24. Here are the options as I see them:
For duplicate filtering
There is the DUPEFILTER_CLASS in the settings.py file which is still referenced in the documentation. I have also seen documentation refer to putting a duplicate filter in the ItemPipeline as seen here: http://doc.scrapy.org/en/latest/topics/item-pipeline.html?highlight=duplicates#duplicates-filter
Are people using DUPEFILTER_CLASS or putting a dupefilter in an Item Pipeline?
For persistent duplicate tracking
I have tried to use scrapy-redis to persist the URLs that have previously been scraped so they can then be used by a duplicate filter, but it seems that any DUPEFILTER_CLASS that I am using is ignored. I also see that there is spider.state which will store a dictionary if you use the JOBDIR option at runtime, but this doesn't seem to be very optimal to use for duplicate filtering.
Can anyone point me to a code sample that can be used to persist data between batches and do duplicate filtering?
I don't have a code snippet nor enough rep to comment on your question but here is a suggestion for the persistent url dupe filtering idea.
Keep a database of crawled urls.
Implement a downloader middleware that does the following (pythonish pseudo code):
if url isn't present:
add url to databse
return None # this tells scrapy to keep handling request as normal
else:
raise IgnoreRequest
HTH
edit: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html
If you are using a crawl spider you can do this for the non-persistent filtering:
rules = (
Rule(SgmlLinkExtractor(unique=True,
deny=[r'.*QuickInfo.*'],
allow_domains=allowed_domains,
restrict_xpaths=['//*[starts-with(@id, "e")]//a',
'//*[starts-with(@id, "HP_Priority")]//a']),
follow=True),)
unique=True will filter duplicate requests for this spider instance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With