I have this piece of code:
import time
d = dict()
for i in range(200000):
d[i] = "DUMMY"
start_time = time.time()
for i in range(200000):
for key in d:
if len(d) > 1 or -1 not in d:
break
del d[i]
print("--- {} seconds ---".format(time.time() - start_time))
Why does this take ~15 seconds to run?
But, if I comment out del d[i] or the inner loop, it runs in ~0.1 seconds.
The issue you have is caused by iterating over even one element (e.g. next(iter(d))) of a dictionary that was once large, but has been shrunk a great deal. This can be nearly slow as iterating over all of the dictionary items if you get unlucky with your hash values. And this code is very "unlucky" (predictably so, due to Python hash design).
The reason for the issue is that Python does not rebuild a dictionary's hash table when you remove items. So the hash table for a dictionary that used to have 200000 items in it, but which now has only 1 left is still has more than 200000 spaces in it (and probably more, since it was probably not entirely full at its peak).
When you're iterating the dictionary when it has all its values in it, finding the first one is pretty simple. The first one will be in one of the first few table entries. But as you empty out the table, more and more blank spaces will be at the start of the table and the search for the first value that still exists will take longer and longer.
This might be even worse given that you're using integer keys, which (mostly) hash to themselves (only -1 hashes to something else). This means that the first key in the "full" dictionary will usually be 0, the next 1, and so on. As you delete the values in increasing order, you'll be very precisely removing the earliest keys in the table first, and so making the searches maximally worse.
It's because this
for key in d:
if len(d) > 1 or -1 not in d:
break
will break on the first iteration, so your inner loop is basically a no-op.
Adding del[i] makes it do some real work, which takes time.
Update: well the above is obviously way to simplistic :-)
The following version of your code shows the same characteristic:
import time
import gc
n = 140000
def main(d):
for i in range(n):
del d[i] # A
for key in d: # B
break # B
import dis
d = dict()
for i in range(n):
d[i] = "DUMMY"
print dis.dis(main)
start_time = time.time()
main(d)
print("--- {} seconds ---".format(time.time() - start_time))
Using iterkeys doesn't make a difference.
If we plot the run time on different sizes of n we get (n on the x-axis, seconds on the y-axis):

so clearly something exponential going on.
Deleting line (A) or lines (B) removes the exponential component, although I'm not sure why.
Update 2: Based on @Blckknght's answer, we can regain some of the speed by infrequently rehashing the items:
def main(d):
for i in range(n):
del d[i]
if i % 5000 == 0:
d = {k:v for k, v in d.items()}
for key in d:
break
or this:
def main(d):
for i in range(n):
del d[i]
if i % 6000 == 0:
d = {k:v for k, v in d.items()}
try:
iter(d).next()
except StopIteration:
pass
takes under half the time of the original on large n (the bump at 130000 is consistent over 4 runs..).

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With