Help! Reading the source code of Scrapy is not easy for me.
I have a very long start_urls list. it is about 3,000,000 in a file. So,I make the start_urls like this:
start_urls = read_urls_from_file(u"XXXX")
def read_urls_from_file(file_path):
with codecs.open(file_path, u"r", encoding=u"GB18030") as f:
for line in f:
try:
url = line.strip()
yield url
except:
print u"read line:%s from file failed!" % line
continue
print u"file read finish!"
MeanWhile, my spider's callback functions are like this:
def parse(self, response):
self.log("Visited %s" % response.url)
return Request(url=("http://www.baidu.com"), callback=self.just_test1)
def just_test1(self, response):
self.log("Visited %s" % response.url)
return Request(url=("http://www.163.com"), callback=self.just_test2)
def just_test2(self, response):
self.log("Visited %s" % response.url)
return []
my questions are:
just_test1,just_test2 be used by downloader only after the all
start_urls are used?(I have made some tests, it seems that the
answer is No)Thank you very much!!!
Thanks for answers.But I am still a bit confused: By default, Scrapy uses a LIFO queue for storing pending requests.
requests made by spiders' callback function will be given to the
scheduler.Who does the same thing to start_url's requests?The spider
start_requests() function only generate an iterator without giving
the real requests. requests(start_url's and callback's) be in the same request's queue?How many queues are there in Scrapy?First of all, please see this thread - I think you'll find all the answers there.
the order of the urls used by downloader? Will the requests made by just_test1,just_test2 be used by downloader only after the all start_urls are used?(I have made some tests, it seems that the answer is No)
You are right, the answer is No. The behavior is completely asynchronous: when the spider starts, start_requests method is called (source):
def start_requests(self):
for url in self.start_urls:
yield self.make_requests_from_url(url)
def make_requests_from_url(self, url):
return Request(url, dont_filter=True)
What decides the order? Why and How is this order? How can we control it?
By default, there is no pre-defined order - you cannot know when Requests from make_requests_from_url will arrive - it's asynchronous.
See this answer on how you may control the order.
Long story short, you can override start_requests and mark yielded Requests with priority key (like yield Request(url, meta={'priority': 0})). For example, the value of priority can be the line number where the url was found.
Is this a good way to deal with so many urls which are already in a file? What else?
I think you should read your file and yield urls directly in start_requests method: see this answer.
So, you should do smth like this:
def start_requests(self):
with codecs.open(self.file_path, u"r", encoding=u"GB18030") as f:
for index, line in enumerate(f):
try:
url = line.strip()
yield Request(url, meta={'priority': index})
except:
continue
Hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With