Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy raises ReactorNotRestartable when CrawlerProcess is ran twice

I have some code which looks something like this:

def run(spider_name, settings):
    runner = CrawlerProcess(settings)
    runner.crawl(spider_name)
    runner.start()
    return True

I have two py.test tests which each call run(), when the second test executes I get the following error.

    runner.start()
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/scrapy/crawler.py:291: in start
    reactor.run(installSignalHandlers=False)  # blocking call
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1242: in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1222: in startRunning
    ReactorBase.startRunning(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <twisted.internet.selectreactor.SelectReactor object at 0x10fe21588>

    def startRunning(self):
        """
            Method called when reactor starts: do some initialization and fire
            startup events.

            Don't call this directly, call reactor.run() instead: it should take
            care of calling this.

            This method is somewhat misnamed.  The reactor will not necessarily be
            in the running state by the time this method returns.  The only
            guarantee is that it will be on its way to the running state.
            """
        if self._started:
            raise error.ReactorAlreadyRunning()
        if self._startedBefore:
>           raise error.ReactorNotRestartable()
E           twisted.internet.error.ReactorNotRestartable

I get this reactor thing is already running so I cannot runner.start() when the second test runs. But is there some way to reset its state inbetween the tests? So they are more isolated and actually can run after one another.

like image 827
Joe Roe Avatar asked Oct 21 '25 03:10

Joe Roe


1 Answers

According to the scrapy docs:

By default, Scrapy runs a single spider per process when you run scrapy crawl. However, Scrapy supports running multiple spiders per process using the internal API.

For example:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

If you want to run another spider after you've called process.start then I expect you can just issue another process.crawl(SomeSpider) call at the point in your program where you determine the need to do this.

Examples of other scenarios are given in the docs.

like image 153
Jean-Paul Calderone Avatar answered Oct 23 '25 16:10

Jean-Paul Calderone