Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - Correct way to change User Agent in Request

I have created a custom Middleware in Scrapy by overriding the RetryMiddleware which changes both Proxy and User-Agent before retrying. It looks like this

class CustomRetryMiddleware(RetryMiddleware):
    def _retry(self, request, reason, spider):
        retries = request.meta.get('retry_times', 0) + 1

        if retries <= self.max_retry_times:
            Proxy_UA_Middleware.switch_proxy()
            Proxy_UA_Middleware.switch_ua()
            logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s",
                         {'request': request, 'retries': retries, 'reason': reason},
                         extra={'spider': spider})
            retryreq = request.copy()
            retryreq.meta['retry_times'] = retries
            retryreq.dont_filter = True
            retryreq.priority = request.priority + self.priority_adjust
            return retryreq
        else:
            logger.debug("Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
                         {'request': request, 'retries': retries, 'reason': reason},
                         extra={'spider': spider})

The Proxy_UA_Middlware class is quite long. Basically it contains methods that change proxy and user agent. I have both these middlewares configured properly in my settings.py file. The proxy part works okay but the User Agent doesn't change. The code I've used to changed User Agent looks like this

request.headers.setdefault('User-Agent', self.user_agent)

where self.user_agent is a random value taken from an array of user agents. This doesn't work. However, if I do this

request.headers['User-Agent'] = self.user_agent

then it works just fine and the user agent changes successfully for each retry. But I haven't seen anyone use this method to change the User Agent. My question is if changing the User Agent this way is okay and if not what am I doing wrong?

like image 659
detweiller Avatar asked Oct 26 '25 05:10

detweiller


1 Answers

If you always want to control which user-agent to use on that middleware, then it is ok, what setdefault does is to check if there is no User-Agent assigned before, which is possible because other middlewares could be doing it, or even assigning it from the spider.

Also I think you should also disable the default UserAgentMiddleware or even set a higher priority to your middleware, check that UserAgentMiddleware priority is 400, so set yours to be before (some number before 400).

like image 104
eLRuLL Avatar answered Oct 28 '25 18:10

eLRuLL



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!