Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pausing and resuming a self contained scrapy script

I'm running a self contained Scrapy spider, which lives in a single .py file. In case of server fail/power outage/any other reason that the script might fail, is there an elegant way to make sure that I will be able to resume a run after recovery?

Maybe something similar to the built in JOBDIR setting?

like image 906
m.livs Avatar asked Dec 08 '25 16:12

m.livs


2 Answers

You can still use JOBDIR option if you have a self-contained script, e.g. you can set a value in custom_settings attribute:

class MySpider(scrapy.Spider):
    custom_settings = {
        'JOBDIR': './job',
    }
    #...

Alternatively, you can set this option when creating CrawlerProcess (if that's what you're using to run spiders in a script):

process = CrawlerProcess({'JOBDIR': './job'})
process.crawl(MySpider)
process.start()
like image 138
Mikhail Korobov Avatar answered Dec 10 '25 10:12

Mikhail Korobov


There's a whole documentation page covering this issue:

To start a spider with persistence supported enabled, run it like this:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a signal), and resume it later by issuing the same command:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

like image 37
Granitosaurus Avatar answered Dec 10 '25 10:12

Granitosaurus



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!