Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Persistent Duplicate Filtering in Scrapy

I have just started using scrapy and I would like a way to persist the URLs that have previously been crawled so I can run subsequent scrapes and only take in new data from unknown URLs. I am seeing a couple of different ways to filter for duplicates and a couple of ways to persist data. I would like to know what the recommended way for doing these activities is in scrapy version 0.24. Here are the options as I see them:

For duplicate filtering

There is the DUPEFILTER_CLASS in the settings.py file which is still referenced in the documentation. I have also seen documentation refer to putting a duplicate filter in the ItemPipeline as seen here: http://doc.scrapy.org/en/latest/topics/item-pipeline.html?highlight=duplicates#duplicates-filter

Are people using DUPEFILTER_CLASS or putting a dupefilter in an Item Pipeline?

For persistent duplicate tracking

I have tried to use scrapy-redis to persist the URLs that have previously been scraped so they can then be used by a duplicate filter, but it seems that any DUPEFILTER_CLASS that I am using is ignored. I also see that there is spider.state which will store a dictionary if you use the JOBDIR option at runtime, but this doesn't seem to be very optimal to use for duplicate filtering.

Can anyone point me to a code sample that can be used to persist data between batches and do duplicate filtering?

like image 710
ajt Avatar asked Feb 01 '26 13:02

ajt


1 Answers

I don't have a code snippet nor enough rep to comment on your question but here is a suggestion for the persistent url dupe filtering idea.

Keep a database of crawled urls.

Implement a downloader middleware that does the following (pythonish pseudo code):

if url isn't present: 
   add url to databse
   return None    # this tells scrapy to keep handling request as normal
else:
   raise IgnoreRequest

HTH

edit: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html

If you are using a crawl spider you can do this for the non-persistent filtering:

    rules = (
        Rule(SgmlLinkExtractor(unique=True,
                           deny=[r'.*QuickInfo.*'],
                           allow_domains=allowed_domains,
                           restrict_xpaths=['//*[starts-with(@id, "e")]//a',
                                            '//*[starts-with(@id, "HP_Priority")]//a']),
         follow=True),)

unique=True will filter duplicate requests for this spider instance.

like image 156
rocktheartsm4l Avatar answered Feb 04 '26 04:02

rocktheartsm4l



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!