I'm beginner in Python & Scrapy. I've just create a Scrapy project with multiple spiders, when running "scrapy crawl .." it runs only the first spider.
How can I run all spiders in the same process?
Thanks in advance.
You will have a name for every spider in the file that says name="youspidername". and when you call it using scrapy crawl yourspidername, it will crawl only that spider. you will have to again give a command to run the other spider using scrapy crawl youotherspidername.
The other way is to just mention all the spiders in the same command like scrapy crawl yourspidername,yourotherspidername,etc.. (this method is not supported for the newer versions of scrapy)
Everyone, even the docs, suggest using the internal API to author a "run script" which controls the start and stop of multiple spiders. However, this comes with a lot of caveats unless you get it absolutely correct (feedexports not working, the twisted reactor either not stopping or stopping too soon etc).
In my opinion, we have a known working and supported scrapy crawl x command and therefore a much easier way to handle this is to use GNU Parallel to parellize.
After install, to run (from the shell) one scrapy spider per core and assuming you wish to run all the ones in your project:
scrapy list | parallel --line-buffer scrapy crawl
If you only have one core, you can play around with the --jobs argument to GNU Parallel. For example, the following will run 2 scrapy jobs per core:
scrapy list | parallel --jobs 200% --line-buffer scrapy crawl
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With