I have multiple websites stored in database with different crawl time like every 5/10 minutes for every websites. I have created spider to crawl and running with cron. It will take all the websites from database and run crawling parallely for all websites. How can I implement to crawl each websites with different timing which is stored in the database? Is there any way to handle this in scrapy?
Have you tried playing around with adding a scheduling component in start_requests?
def start_requests(self):
while:
for spid_url in url_db['to_crawl'].find(typ='due'):
// update url to crawltime
yield scrapy.Request(...)
// sleep until next_url_is_due
// set_crawl_to_due
if enough:
break
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With