I'm running a self contained Scrapy spider, which lives in a single .py file. In case of server fail/power outage/any other reason that the script might fail, is there an elegant way to make sure that I will be able to resume a run after recovery?
Maybe something similar to the built in JOBDIR setting?
You can still use JOBDIR option if you have a self-contained script, e.g. you can set a value in custom_settings attribute:
class MySpider(scrapy.Spider):
custom_settings = {
'JOBDIR': './job',
}
#...
Alternatively, you can set this option when creating CrawlerProcess (if that's what you're using to run spiders in a script):
process = CrawlerProcess({'JOBDIR': './job'})
process.crawl(MySpider)
process.start()
There's a whole documentation page covering this issue:
To start a spider with persistence supported enabled, run it like this:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a signal), and resume it later by issuing the same command:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With