Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoid getting banned on sites using scrapy

I'm trying to download data from gsmarena. A sample code to download HTC one me spec is from the following site "http://www.gsmarena.com/htc_one_me-7275.php" as mentioned below:

The data on the website is classified in form of tables and table rows. The data is of the format:

table header > td[@class='ttl'] > td[@class='nfo']

Items.py file:

import scrapy

class gsmArenaDataItem(scrapy.Item):
    phoneName = scrapy.Field()
    phoneDetails = scrapy.Field()
    pass

Spider file:

from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem

class testSpider(Spider):
    name = "mobile_test"
    allowed_domains = ["gsmarena.com"]
    start_urls = ('http://www.gsmarena.com/htc_one_me-7275.php',)

    def parse(self, response):
        # extract whatever stuffs you want and yield items here
        hxs = Selector(response)
        phone = gsmArenaDataItem()
        tableRows = hxs.css("div#specs-list table")
        for tableRows in tableRows:
            phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
            for ttl in tableRows.xpath(".//td[@class='ttl']"):
                ttl_value = " ".join(ttl.xpath(".//text()").extract())
                nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
                colonSign = ": "
                commaSign = ", "
                seq = [ttl_value, colonSign, nfo_value, commaSign]
                seq = seq.join(seq)
        phone['phoneDetails'] = seq
        yield phone

However, I'm getting banned as soon as I try to even load the page in scrapy shell using:

"http://www.gsmarena.com/htc_one_me-7275.php"

I've even tried using DOWNLOAD_DELAY = 3 in settings.py.

Kindly suggest how should I go about it.

like image 926
ajhavery Avatar asked Jan 24 '26 22:01

ajhavery


1 Answers

That's probably happening because of the Scrapy's User Agent. As you can see here, the BOT_NAME variable is used to compose the USER_AGENT. My guess is that the site you want to crawl is blocking that. I tried to look over their robots.txt file but got no clue from there.

You can try to set up a custom UserAgent. In your settings.py add the following line:

USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0"

Actually, your USER_AGENT probably might be anyone related to a browser

like image 81
FBidu Avatar answered Jan 26 '26 11:01

FBidu