I have a normal boring list of non sorted numbers. From that list I need to take the first k elements after sorting. The thing is that if the list is considerably long and k is considerably small sorting the entire list seems like a waste. I came up with an algorithmic solution for this, but requires me to write my own implementation for sorting, my question is: is there a way to get the same efficiency using something already implemented in python?
UPDATE:
Just to clarify, I know this will give the answer I need: sorted(boring_list)[:n]
But my concern is efficiency: I don't need to sort the whole list for this.
You can use the heapq module, in particular its nlargest or nsmallest functions.
Alternatively just build the heap and call heappop(). This should take O(n) time to build the heap and O(k*log(n)) to retrieve the k elements.
Here's a very simple and small benchmark:
In [1]: import random, heapq
In [2]: seq = [random.randint(-5000, 5000) for _ in range(35000)]
In [3]: %timeit sorted(seq)[:75]
100 loops, best of 3: 14.5 ms per loop
In [4]: %%timeit
...: s = seq[:]
...: heapq.nsmallest(75, s)
...:
100 loops, best of 3: 4.05 ms per loop
In [5]: %%timeit
...: s = seq[:]
...: heapq.heapify(s)
...: for _ in range(75): heapq.heappop(s)
...:
100 loops, best of 3: 2.41 ms per loop
I have no idea why nsmallest is so much slower then calling heappop directly. In fact I should have timed it without copying seq but still:
In [6]: %%timeit
...: heapq.nsmallest(75, seq)
...:
100 loops, best of 3: 3.82 ms per loop
Increasing the length by 100 times:
In [12]: %timeit sorted(seq)[:75]
1 loops, best of 3: 1.9 s per loop
In [13]: %%timeit
...: heapq.nsmallest(75, seq)
...:
1 loops, best of 3: 352 ms per loop
In [14]: %%timeit
...: s = seq[:]
...: heapq.heapify(s)
...: for _ in range(75): heapq.heappop(s)
...:
1 loops, best of 3: 264 ms per loop
Note: to counter F.J biased profiling:
In [13]: a = list(range(1000000))
In [14]: random.shuffle(a)
In [15]: %timeit sorted(a)
1 loops, best of 3: 985 ms per loop
In [16]: %%timeit
...: s = a[:]
...: heapq.heapify(s)
...:
1 loops, best of 3: 284 ms per loop
As you can see heapify is quite faster then sorting even on 1000000 elements lists.
Use heapq.nsmallest.
Maintaining the heap invariant is O(logk) where k is the size of the heap; you have to perform n push operations, making the overall complexity O(n logk). Compare this to sorting-and-taking-the-first-k-elements, which is overall complexity O(n logn). When k is small compared to n, the heapq approach clearly wins.
When k approaches n, you should instead just sort and take the first k - timsort is really good :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With