I need to crawl a number of sites and I only want to crawl a certain number of pages each site. So how to implement this?
My thought is use a dict which the key is the domain name and the value is the number of pages that have been stored in mongodb. so when a page is crawled and stored in the database successfully then the number of pages of this domain will increase by one. if the number is greater than the maximum number then the spider should stop crwling from this site.
Below is my code but it didn't work. when spider.crawledPagesPerSite[domain_name] is greater than spider.maximumPagesPerSite:, the spiders is still crawling.
class AnExampleSpider(CrawlSpider):
name="anexample"
rules=(
Rule(LinkExtractor(allow=r"/*.html"),
callback="parse_url",follow=True),
)
def __init__(self, url_file ): #, N=10,*a, **kw
data = open(url_file, 'r').readlines() #[:N]
self.allowed_domains = [ i.strip() for i in data ]
self.start_urls = ['http://' + domain for domain in self.allowed_domains]
super(AnExampleSpider, self).__init__()#*a, **kw
self.maximumPagesPerSite=100 #maximum pages each site
self.crawledPagesPerSite={}
def parse_url(self, response):
url=response.url
item=AnExampleItem()
html_text=response.body
extracted_text=parse_page.parse_page(html_text)
item["url"]=url
item["extracted_text"]=extracted_text
return item
class MongoDBPipeline(object):
def __init__(self):
self.connection = pymongo.MongoClient( settings['MONGODB_SERVER'], settings['MONGODB_PORT'] )
def process_item(self, item, spider):
domain_name=tldextract.extract(item['url']).domain
db = self.connection[domain_name] #use domain name as database name
self.collection = db[settings['MONGODB_COLLECTION']]
valid = True
for data in item:
if not data:
valid = False
raise DropItem("Missing {0}!".format(data))
if valid:
self.collection.insert(dict(item))
log.msg("Item added to MongoDB database!",level=log.DEBUG, spider=spider)
if domain_name in spider.crawledPagesPerSite:
spider.crawledPagesPerSite[domain_name]+=1
else:
spider.crawledPagesPerSite[domain_name]=1
if spider.crawledPagesPerSite[domain_name]>spider.maximumPagesPerSite:
suffix=tldextract.extract(item['url']).suffix
domain_and_suffix=domain_name+"."+suffix
if domain_and_suffix in spider.allowed_domains:
spider.allowed_domains.remove(domain_and_suffix)
spider.rules[0].link_extractor.allow_domains.remove(domain_and_suffix)
return None
return item
I am not sure if this is what you're looking for, but I use this approach for scraping a certain number of pages only. Let's say I want to scrape only the starting 99 pages from example.com, I'll go about it the following way:
start_urls = ["https://example.com/page-%s.htm" % page for page in list(range(100))]
The code will stop working after reaching to page#99. But this only works when you have urls that have page numbers in them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With