urllib.urlretrieve returns silently even if the file doesn't exist on the remote http server, it just saves a html page to the named file. For example:
urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg') just returns silently, even if abc.jpg doesn't exist on google.com server, the generated abc.jpg is not a valid jpg file, it's actually a html page . I guess the returned headers (a httplib.HTTPMessage instance) can be used to actually tell whether the retrieval successes or not, but I can't find any doc for httplib.HTTPMessage.
Can anybody provide some information about this problem?
The problem here is that urlopen returns a reference to a file object from which you should retrieve HTML. Please note that urllib. urlopen function is marked as deprecated since python 2.6. It's recommended to use urllib2.
The urlretrieve() function provided by the urllib module. The urlretrieve () method downloads the remote data directly to the local. The parameter filename specifies the save local path (if the parameter is not specified, urllib will generate a temporary file to save the data.)
Consider using urllib2 if it possible in your case.  It is more advanced and easy to use than urllib.
You can detect any HTTP errors easily:
>>> import urllib2 >>> resp = urllib2.urlopen("http://google.com/abc.jpg") Traceback (most recent call last): <<MANY LINES SKIPPED>> urllib2.HTTPError: HTTP Error 404: Not Found resp is actually HTTPResponse object that you can do a lot of useful things with:
>>> resp = urllib2.urlopen("http://google.com/") >>> resp.code 200 >>> resp.headers["content-type"] 'text/html; charset=windows-1251' >>> resp.read() "<<ACTUAL HTML>>" If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With