Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recursive spider with scrapy only scraping first page

Tags:

python

scrapy

I'm attempting to write a spider that recursively scrapes an entire site, using scrapy.

However, while it seems to scrape the first page fine, it then finds the links on that page, but doesn't follow them and scrape those pages, which is what I need.

I've created a scrapy project and started writing a spider that looks like this:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from urlparse import urljoin

class EventsSpider(scrapy.Spider):
    name = "events"
    allowed_domains = ["www.foo.bar/"]
    start_urls = (
        'http://www.foo.bar/events/',
        )
    rules = (
        Rule(LinkExtractor(), callback="parse", follow= True),
        )

    def parse(self, response):
        yield {
        'url':response.url,
        'language':response.xpath('//meta[@name=\'Language\']/@content').extract(),
        'description':response.xpath('//meta[@name=\'Description\']/@content').extract(),
        }
        for url in response.xpath('//a/@href').extract():
            if url and not url.startswith('#'):
                self.logger.debug(urljoin(response.url, url))
                scrapy.http.Request(urljoin(response.url, url))

Then, when running the spider by using scrapy crawl events -o events.json

I get the output in the console of:

2016-05-16 09:50:04 [scrapy] INFO: Spider closed (finished)
PS C:\Projects\foo\src\Scrapy> scrapy crawl events -o .\events.json
2016-05-16 09:54:36 [scrapy] INFO: Scrapy 1.1.0 started (bot: foo)
2016-05-16 09:54:36 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'foo.spiders', 'FEED_URI': '.\\events.json
', 'SPIDER_MODULES': ['foo.spiders'], 'BOT_NAME': 'foo', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'json'}
2016-05-16 09:54:36 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-05-16 09:54:36 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-05-16 09:54:36 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-05-16 09:54:36 [scrapy] INFO: Enabled item pipelines:
[]
2016-05-16 09:54:36 [scrapy] INFO: Spider opened
2016-05-16 09:54:36 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-16 09:54:36 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-16 09:54:36 [scrapy] DEBUG: Crawled (200) <GET http://www.foo.co.uk/robots.txt> (referer: None)
2016-05-16 09:54:37 [scrapy] DEBUG: Crawled (200) <GET http://www.foo.co.uk/events/> (referer: None)
2016-05-16 09:54:37 [scrapy] DEBUG: Scraped from <200 http://www.foo.co.uk/events/>
{'description': [], 'language': [u'en_UK'], 'url': 'http://www.foo.co.uk/events/'}
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/default.aspx
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/page/a-z/
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/thing/
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/other-thing/
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/foo-about-us/
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/contactus
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/events/bar
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/events/event
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/events/super-cool-party
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/events/another-event
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/events/more-events
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/events/tps-report-convention
...

more links

...

2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/events/tps-report-convention-two-the-return
2016-05-16 09:54:37 [scrapy] INFO: Closing spider (finished)
2016-05-16 09:54:37 [scrapy] INFO: Stored json feed (1 items) in: .\events.json
2016-05-16 09:54:37 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 524,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 6187,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 5, 16, 8, 54, 37, 271000),
 'item_scraped_count': 1,
 'log_count/DEBUG': 80,
 'log_count/INFO': 8,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 5, 16, 8, 54, 36, 913000)}
2016-05-16 09:54:37 [scrapy] INFO: Spider closed (finished)

And then in the events.json file produced by the crawl, the only page that seems to have been scraped is the start url specified at the top of the script, when really I need all of the pages that match /events/ to be scraped instead.

I'm not sure how to proceed on this, so any help on the matter would be greatly appreciated.

Thanks.

like image 728
Jordan Robinson Avatar asked Dec 04 '25 13:12

Jordan Robinson


1 Answers

You should create a Item object. Also use CrawlSpider that you imported. I made a few changes to your code, try to use it.

import scrapy
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from urlparse import urljoin
#todo
from your_project.items import YourItem

class EventsSpider(CrawlSpider):
    name = "events"
    allowed_domains = ["foo.bar"]
    start_urls = [
        'http://www.foo.bar/events/',
    ]
    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = YourItem()
        item['url'] = response.url
        item['language'] = response.xpath('//meta[@name=\'Language\']/@content').extract()
        item['description'] = response.xpath('//meta[@name=\'Description\']/@content').extract()
        yield item

        for url in response.xpath('//a/@href').extract():
            if url and not url.startswith('#'):
                self.logger.debug(urljoin(response.url, url))
                scrapy.http.Request(urljoin(response.url, url))
like image 172
Daniil Mashkin Avatar answered Dec 07 '25 13:12

Daniil Mashkin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!