Urllib progresshook not accurate?

Question

I am making a program the downloads a large file, and I have added in a feature with which the program determines what percentage has been downloaded and informs the user each time another 10% was downloaded and at what time (ie, print (str(percent) + " downloaded at " + str(time))) When I was testing the program on smaller files, however, I noticed it was far less accurate. Here is a sample program I made:

import urllib.request

def printout(a, b, c):
    print(str(a) + ", " + str(b) + ", " + str(c))

urllib.request.urlretrieve("http://downloadcenter.mcafee.com/products/tools/foundstone/fport.zip", r"C:\Users\Username\Downloads\fport.zip", reporthook = printout)

This downloads Fport, a tool I was going to download anyway. Anyway, I got this output:

0, 8192, 57843
1, 8192, 57843
2, 8192, 57843
3, 8192, 57843
4, 8192, 57843
5, 8192, 57843
6, 8192, 57843
7, 8192, 57843
8, 8192, 57843

Which I thought was exactly what I wanted. I was about to put it in when I noticed a little error. 8192 doesn't go into 57843. Not 8 times. I plugged it into a calculator and discovered that, in fact, it goes in approximately 7 times. Which is a rather large difference, considering. This disconnect affects bigger files less, but it is still there. Is this some kind of metadata or header? If so, it's rather large, isn't it? Is there a way I can account for it (ie, is it always about 16000 bytes)?

m01 · Accepted Answer

So, if you look at the Lib/urllib/request.py (CPython around 2.7) code it becomes clear why this is the case:

    with tfp:
        result = filename, headers
        bs = 1024*8  # we read 8KB at a time.
        size = -1
        read = 0
        blocknum = 0
        if "content-length" in headers:
            size = int(headers["Content-Length"])

        if reporthook:
            reporthook(blocknum, bs, size)

        while True:
            block = fp.read(bs)  # here is where we do the read
            if not block:
                break
            read += len(block)
            tfp.write(block)
            blocknum += 1
            if reporthook:
                reporthook(blocknum, bs, size)

In the last line, the reporthook is told that bs was read, not len(block), which would probably be more accurate. I'm not sure why this is the case, i.e. if there's a good reason or if it's a minor bug in the library. You could ask on the Python mailers and/or file a bug of course.

Note: I think it's fairly common to read data in fixed-sized blocks, see for example fread. There, the return value may not be the same as the number of bytes requested to be read if an EOF (end of file) was encountered, which is similar in the Python read API.

abarnert · Answer

The documentation explains that reporthook is called once per "chunk", with a chunk number and total size.

urllib.request will not try to make chunk sizes exactly equal; it will try to make chunk sizes a nice power of 2 like 8192, because that's generally fastest and simplest.

So, what you want to do is use the actual bytes for calculating percentage, not the chunk numbers.

The urlretrieve interface doesn't give you an easy way to get the actual bytes. Counting blocks only works if you assume every socket.recv(n) (but the last) actually returns n bytes, which isn't guaranteed. os.stat(filename) only works (on most platforms) if you assume urlretrieve uses unbuffered files or flushes before every call, which again isn't guaranteed.

This is one of the many reasons not to use the "legacy interface".

The high-level interface (just calling urllib.request.urlopen and using the Response as a file object) may look like it's providing less information than urlretrieve, but if you read urllib.request Restrictions, it makes it pretty clear that this is an illusion. So, you could just use urlopen, in which case you're just copying from one file object to another instead of using a limited callback interface, so you can use any file-object-copying functions you like, or write your own:

def copy(fin, fout, flen=None):
    sofar = 0
    while True:
        buf = fin.read(8192)
        if not buf:
            break
        sofar += len(buf)
        if flen:
            print('{} / {} bytes'.format(sofar, flen))
        fout.write(buf)
    print('All done')

r = urllib.request.urlopen(url)
with open(path, 'wb') as f:
    copy(r, f, r.headers.get('Content-Length'))

If you really want something that hooks into the lower-level guts of urllib, then urlretrieve is not that something; it just fakes it. You'll have to create your own opener subclass and the whole mess that goes with it.

If you want an interface that's almost as simple as urlopen but provides as much functionality as a custom opener… well, urllib doesn't have that, which is why third-party modules like requests exist.

Urllib progresshook not accurate?

Tags:

python

urllib

KnightOfNi

2 Answers

m01

abarnert

Recent Activity

Donate For Us

Urllib progresshook not accurate?

Tags:

python

urllib

KnightOfNi

2 Answers

m01

abarnert

Related questions

Recent Activity

Donate For Us