Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python function slowing down for no apparent reason

Tags:

python

I have a python function defined as follows which i use to delete from list1 the items which are already in list2. I am using python 2.6.2 on windows XP

def compareLists(list1, list2):
    curIndex = 0
    while curIndex < len(list1):
        if list1[curIndex] in list2:
            list1.pop(curIndex)
        else:
            curIndex += 1

Here, list1 and list2 are a list of lists

list1 = [ ['a', 11221, '2232'], ['b', 1321, '22342'] .. ]

# list2 has a similar format.

I tried this function with list1 with 38,000 elements and list2 with 150,000 elements. If i put in a print statement to print the current iteration, I find that the function slows down with each iterations. At first, it processes around 1000 or more items in a second and then after a while it reduces to around 20-50 a second. Why can that be happening?

EDIT: In the case with my data, the curIndex remains 0 or very close to 0 so the pop operation on list1 is almost always on the first item.

If possible, can someone also suggest a better way of doing the same thing in a different way?

like image 487
randomThought Avatar asked Apr 27 '26 11:04

randomThought


2 Answers

Try a more pythonic approach to the filtering, something like

[x for x in list1 if x not in set(list2)]

Converting both lists to sets is unnessescary, and will be very slow and memory hungry on large amounts of data.

Since your data is a list of lists, you need to do something in order to hash it. Try out

list2_set = set([tuple(x) for x in list2])
diff = [x for x in list1 if tuple(x) not in list2_set]

I tested out your original function, and my approach, using the following test data:

list1 = [[x+1, x*2] for x in range(38000)]
list2 = [[x+1, x*2] for x in range(10000, 160000)]

Timings - not scientific, but still:

 #Original function
 real    2m16.780s
 user    2m16.744s
 sys     0m0.017s

 #My function
 real    0m0.433s
 user    0m0.423s
 sys     0m0.007s
like image 101
gnud Avatar answered Apr 29 '26 00:04

gnud


There are 2 issues that cause your algorithm to scale poorly:

  1. x in list is an O(n) operation.
  2. pop(n) where n is in the middle of the array is an O(n) operation.

Both situations cause it to scale poorly O(n^2) for large amounts of data. gnud's implementation would probably be the best solution since it solves both problems without changing the order of elements or removing potential duplicates.

like image 43
fengb Avatar answered Apr 29 '26 01:04

fengb



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!