I am using Scrapy, it is great!  so fast to build a crawler.  with the number of web sites are increasing,  need to create new spiders, but these web sits are the same type, all these spiders use  same items, pipelines, parsing process
the contents of the project directory:
test/ ├── scrapy.cfg └── test     ├── __init__.py     ├── items.py     ├── mybasespider.py     ├── pipelines.py     ├── settings.py     ├── spider1_settings.py     ├── spider2_settings.py     └── spiders         ├── __init__.py         ├── spider1.py         └── spider2.py To reduce source code redundancy, mybasespider.py has a base spider MyBaseSpider, 95% source code are in it,  all other spiders inherited from it, if a spider has some special things, override some  class methods, normally only need to add several lines source code to create a new spider
Place all common settings in settings.py,   one spider's special settings are in [spider name]_settings.py, such as:
the special settings of spider1 in spider1_settings.py:
from settings import *  LOG_FILE = 'spider1.log' LOG_LEVEL = 'INFO' JOBDIR = 'spider1-job' START_URLS = [     'http://test1.com/', ] the special settings of spider2 in spider2_settings.py:
from settings import *  LOG_FILE = 'spider2.log' LOG_LEVEL = 'DEBUG' JOBDIR = 'spider2-job' START_URLS = [     'http://test2.com/', ] Scrapy uses LOG_FILE, LOG_LEVEL, JOBDIR before launching a spider;
All urls in START_URLS are filled into MyBaseSpider.start_urls, different spider has different contents, but the name START_URLS used in the base spider MyBaseSpider isn't changed.
the contents of the scrapy.cfg:
[settings] default = test.settings spider1 = spider1.settings spider2 = spider2.settings  [deploy] url = http://localhost:6800/ project = test To run a spider, such as spider1:
export SCRAPY_PROJECT=spider1
scrapy crawl spider1
But this way can't be used to run spiders in scrapyd. scrapyd-deploy command always uses 'default' project name in scrapy.cfg 'settings' section to build an egg file and deploys it to scrapyd
Have several questions:
Is this the way to use multiple spiders in one project if I don't create a project per spider? Are there any better ways?
How to separate a spider's special settings as above which can run in scrapyd and reduce source code redundancy
If all spiders use a same JOBDIR, is it safe to run all spiders concurrently? Is the persistent spider state corrupted?
Any insights would be greatly appreciated.
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).
You can use the API to run Scrapy from a script, instead of the typical way of running Scrapy via scrapy crawl . Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor. The first utility you can use to run your spiders is scrapy.
I don't know if it will answer to your first question but I use scrapy with multiple spiders and in the past i use the command
scrapy crawl spider1  but if I had more then one spider this command activate it or another modules so I start to use this command:
scrapy runspider <your full spider1 path with the spiderclass.py>  example: "scrapy runspider home/Documents/scrapyproject/scrapyproject/spiders/spider1.py"
I hope it will help :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With