When trying to flatten a list of lists using python 2.7's built-in sum function, I've ran across some performance issues - not only was the computation slow, but the iterative approach yielded much faster results.
The short code below seems to illustrate this performance gap:
import timeit
def sum1(arrs):
return sum(arrs, [])
def sum2(arrs):
s = []
for arr in arrs:
s += arr
return s
def main():
array_of_arrays = [[0] for _ in range(1000)]
print timeit.timeit(lambda: sum1(array_of_arrays), number=100)
print timeit.timeit(lambda: sum2(array_of_arrays), number=100)
if __name__=='__main__':
main()
On my laptop, I get as output:
>> 0.247241020203
>> 0.0043830871582
Could anyone explain to me why is it so?
Your sum2 uses +=:
for arr in arrs:
s += arr
sum does not use +=. sum is defined to use +. The difference is that s += arr is allowed to perform the operation by mutating the existing s list, while s = s + arr must construct a new list, copying the buffers of the old lists.
With +=, Python can use an efficient list resizing strategy that requires an amount of copying proportional to the size of the final list. For N lists of length K each, this takes time proportional to N*K.
With +, Python cannot do that. For every s = s + arr, Python must copy the entire s and arr lists to construct the new s. For N lists of size K each, the total time spent copying is proportional to N**2 * K, much worse.
Because of this, you should pretty much never use sum to concatenate sequences.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With