Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to crawl a limited number of pages from a site using scrapy?

Tags:

python

scrapy

I need to crawl a number of sites and I only want to crawl a certain number of pages each site. So how to implement this?

My thought is use a dict which the key is the domain name and the value is the number of pages that have been stored in mongodb. so when a page is crawled and stored in the database successfully then the number of pages of this domain will increase by one. if the number is greater than the maximum number then the spider should stop crwling from this site.

Below is my code but it didn't work. when spider.crawledPagesPerSite[domain_name] is greater than spider.maximumPagesPerSite:, the spiders is still crawling.

class AnExampleSpider(CrawlSpider):
name="anexample"
rules=(
    Rule(LinkExtractor(allow=r"/*.html"),
    callback="parse_url",follow=True),
)   
def __init__(self, url_file ): #, N=10,*a, **kw
    data = open(url_file, 'r').readlines() #[:N]
    self.allowed_domains = [ i.strip() for i in data ] 
    self.start_urls = ['http://' + domain for domain in self.allowed_domains]
    super(AnExampleSpider, self).__init__()#*a, **kw

    self.maximumPagesPerSite=100 #maximum pages each site
    self.crawledPagesPerSite={}
def parse_url(self, response):
    url=response.url
    item=AnExampleItem()     
    html_text=response.body
    extracted_text=parse_page.parse_page(html_text)
    item["url"]=url
    item["extracted_text"]=extracted_text
    return item

class MongoDBPipeline(object):
    def __init__(self):
        self.connection = pymongo.MongoClient( settings['MONGODB_SERVER'], settings['MONGODB_PORT'] )

    def process_item(self, item, spider):
        domain_name=tldextract.extract(item['url']).domain
        db = self.connection[domain_name] #use domain name as database name
        self.collection = db[settings['MONGODB_COLLECTION']]
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
            if valid:
                self.collection.insert(dict(item))
                log.msg("Item added to MongoDB database!",level=log.DEBUG, spider=spider)
                if domain_name in spider.crawledPagesPerSite:
                    spider.crawledPagesPerSite[domain_name]+=1
                else:
                    spider.crawledPagesPerSite[domain_name]=1
                if spider.crawledPagesPerSite[domain_name]>spider.maximumPagesPerSite:
                    suffix=tldextract.extract(item['url']).suffix
                    domain_and_suffix=domain_name+"."+suffix

                    if domain_and_suffix in spider.allowed_domains:
                        spider.allowed_domains.remove(domain_and_suffix)
                        spider.rules[0].link_extractor.allow_domains.remove(domain_and_suffix)
                        return None
                return item
like image 958
ningyuwhut Avatar asked Oct 25 '25 08:10

ningyuwhut


1 Answers

I am not sure if this is what you're looking for, but I use this approach for scraping a certain number of pages only. Let's say I want to scrape only the starting 99 pages from example.com, I'll go about it the following way:

start_urls = ["https://example.com/page-%s.htm" % page for page in list(range(100))]

The code will stop working after reaching to page#99. But this only works when you have urls that have page numbers in them.

like image 85
Hamza Rana Avatar answered Oct 26 '25 22:10

Hamza Rana



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!