Here is my code. My parse_item method is not getting called.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class SjsuSpider(CrawlSpider):
name = 'sjsu'
allowed_domains = ['sjsu.edu']
start_urls = ['http://cs.sjsu.edu/']
# allow=() is used to match all links
rules = [Rule(SgmlLinkExtractor(allow=()), follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')]
def parse_item(self, response):
print "some message"
open("sjsupages", 'a').write(response.body)
Your allowed domain should be 'cs.sjsu.edu'.
Scrapy does not allow subdomains of an allowed domain.
Also, your rules could be written as:
rules = [Rule(SgmlLinkExtractor(), follow=True, callback='parse_item')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With