Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - Set TCP Connect Timeout

I'm trying to scrape a website via Scrapy. However, the website is extremely slow at times and it takes almost 15-20 seconds to respond at first request in browser. Anyways, sometimes, when I try to crawl the website using Scrapy, I keep getting TCP Timeout error. Even though the website opens just fine on my browser. Here's the message:

2017-09-05 17:34:41 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.hosane.com/result/spec
ialList> (failed 16 times): TCP connection timed out: 10060: A connection attempt failed because the connected party di
d not properly respond after a period of time, or established connection failed because connected host has failed to re
spond..

I have even overridden the USER_AGENT setting for testing. I don't think DOWNLOAD_TIMEOUT setting works in this case, since it defaults to 180 seconds, and Scrapy doesn't even take 20-30 seconds before giving a TCP timeout error.

Any idea what is causing this issue? Is there a way to set TCP timeout in Scrapy?

like image 612
Asym Avatar asked Oct 25 '25 07:10

Asym


1 Answers

A TCP connection timed out can happen before the Scrapy-specified DOWNLOAD_TIMEOUT because the actual initial TCP connect timeout is defined by the OS, usually in terms of TCP SYN packet retransmissions.

By default on my Linux box, I have 6 retransmissions:

cat /proc/sys/net/ipv4/tcp_syn_retries
6

which, in practice, for Scrapy too, means 0 + 1 + 2 + 4 + 8 + 16 + 32 (+64) = 127 seconds before receiveing a twisted.internet.error.TCPTimedOutError: TCP connection timed out: 110: Connection timed out. from Twisted. (That's the initial trial, then exponential backoff between each retry and not receiving a reply after the 6th retry.)

If I set /proc/sys/net/ipv4/tcp_syn_retries to 8 for example, I can verify that I receive this instead:

User timeout caused connection failure: Getting http://www.hosane.com/result/specialList took longer than 180.0 seconds.

That's because 0+1+2+4+8+16+32+64+128(+256) > 180.

10060: A connection attempt failed... seems to be a Windows socket error code. If you want to change the TCP connection timeout to something at least the DOWNLOAD_TIMEOUT, you'll need to change the TCP SYN retry count. (I don't know how to do it on your system, but Google is your friend.)

like image 65
paul trmbrth Avatar answered Oct 28 '25 03:10

paul trmbrth



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!