Pausing and resuming a self contained scrapy script

Question

I'm running a self contained Scrapy spider, which lives in a single .py file. In case of server fail/power outage/any other reason that the script might fail, is there an elegant way to make sure that I will be able to resume a run after recovery?

Maybe something similar to the built in JOBDIR setting?

Mikhail Korobov · Accepted Answer

You can still use JOBDIR option if you have a self-contained script, e.g. you can set a value in custom_settings attribute:

class MySpider(scrapy.Spider):
    custom_settings = {
        'JOBDIR': './job',
    }
    #...

Alternatively, you can set this option when creating CrawlerProcess (if that's what you're using to run spiders in a script):

process = CrawlerProcess({'JOBDIR': './job'})
process.crawl(MySpider)
process.start()

Granitosaurus · Answer

There's a whole documentation page covering this issue:

To start a spider with persistence supported enabled, run it like this:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a signal), and resume it later by issuing the same command:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

Pausing and resuming a self contained scrapy script

Tags:

python

web

web-scraping

scrapy

web-crawler

m.livs

2 Answers

Mikhail Korobov

Granitosaurus

Recent Activity

Donate For Us

Pausing and resuming a self contained scrapy script

Tags:

python

web

web-scraping

scrapy

web-crawler

m.livs

2 Answers

Mikhail Korobov

Granitosaurus

Related questions

Recent Activity

Donate For Us