how to crawl a limited number of pages from a site using scrapy?

Question

I need to crawl a number of sites and I only want to crawl a certain number of pages each site. So how to implement this?

My thought is use a dict which the key is the domain name and the value is the number of pages that have been stored in mongodb. so when a page is crawled and stored in the database successfully then the number of pages of this domain will increase by one. if the number is greater than the maximum number then the spider should stop crwling from this site.

Below is my code but it didn't work. when spider.crawledPagesPerSite[domain_name] is greater than spider.maximumPagesPerSite:, the spiders is still crawling.

class AnExampleSpider(CrawlSpider):
name="anexample"
rules=(
    Rule(LinkExtractor(allow=r"/*.html"),
    callback="parse_url",follow=True),
)   
def __init__(self, url_file ): #, N=10,*a, **kw
    data = open(url_file, 'r').readlines() #[:N]
    self.allowed_domains = [ i.strip() for i in data ] 
    self.start_urls = ['http://' + domain for domain in self.allowed_domains]
    super(AnExampleSpider, self).__init__()#*a, **kw

    self.maximumPagesPerSite=100 #maximum pages each site
    self.crawledPagesPerSite={}
def parse_url(self, response):
    url=response.url
    item=AnExampleItem()     
    html_text=response.body
    extracted_text=parse_page.parse_page(html_text)
    item["url"]=url
    item["extracted_text"]=extracted_text
    return item

class MongoDBPipeline(object):
    def __init__(self):
        self.connection = pymongo.MongoClient( settings['MONGODB_SERVER'], settings['MONGODB_PORT'] )

    def process_item(self, item, spider):
        domain_name=tldextract.extract(item['url']).domain
        db = self.connection[domain_name] #use domain name as database name
        self.collection = db[settings['MONGODB_COLLECTION']]
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
            if valid:
                self.collection.insert(dict(item))
                log.msg("Item added to MongoDB database!",level=log.DEBUG, spider=spider)
                if domain_name in spider.crawledPagesPerSite:
                    spider.crawledPagesPerSite[domain_name]+=1
                else:
                    spider.crawledPagesPerSite[domain_name]=1
                if spider.crawledPagesPerSite[domain_name]>spider.maximumPagesPerSite:
                    suffix=tldextract.extract(item['url']).suffix
                    domain_and_suffix=domain_name+"."+suffix

                    if domain_and_suffix in spider.allowed_domains:
                        spider.allowed_domains.remove(domain_and_suffix)
                        spider.rules[0].link_extractor.allow_domains.remove(domain_and_suffix)
                        return None
                return item

Hamza Rana · Accepted Answer

I am not sure if this is what you're looking for, but I use this approach for scraping a certain number of pages only. Let's say I want to scrape only the starting 99 pages from example.com, I'll go about it the following way:

start_urls = ["https://example.com/page-%s.htm" % page for page in list(range(100))]

The code will stop working after reaching to page#99. But this only works when you have urls that have page numbers in them.

how to crawl a limited number of pages from a site using scrapy?

Tags:

python

scrapy

ningyuwhut

1 Answers

Hamza Rana

Recent Activity

Donate For Us

how to crawl a limited number of pages from a site using scrapy?

Tags:

python

scrapy

ningyuwhut

1 Answers

Hamza Rana

Related questions

Recent Activity

Donate For Us