Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to crawl multiple websites in different timing in scrapy

I have multiple websites stored in database with different crawl time like every 5/10 minutes for every websites. I have created spider to crawl and running with cron. It will take all the websites from database and run crawling parallely for all websites. How can I implement to crawl each websites with different timing which is stored in the database? Is there any way to handle this in scrapy?

like image 243
bhattraideb Avatar asked Dec 06 '25 04:12

bhattraideb


1 Answers

Have you tried playing around with adding a scheduling component in start_requests?

def start_requests(self):
    while:
        for spid_url in url_db['to_crawl'].find(typ='due'):
            // update url to crawltime
            yield scrapy.Request(...)

        // sleep until next_url_is_due
        // set_crawl_to_due    
        if enough:
            break
like image 78
Thomas Strub Avatar answered Dec 08 '25 17:12

Thomas Strub



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!