Is there a way to sort a list in python until the first sorted k elements are found?

Question

I have a normal boring list of non sorted numbers. From that list I need to take the first k elements after sorting. The thing is that if the list is considerably long and k is considerably small sorting the entire list seems like a waste. I came up with an algorithmic solution for this, but requires me to write my own implementation for sorting, my question is: is there a way to get the same efficiency using something already implemented in python?

UPDATE:

Just to clarify, I know this will give the answer I need: sorted(boring_list)[:n]

But my concern is efficiency: I don't need to sort the whole list for this.

UPDATE:

Just to clarify, I know this will give the answer I need: sorted(boring_list)[:n]

But my concern is efficiency: I don't need to sort the whole list for this.

Bakuriu · Accepted Answer

You can use the heapq module, in particular its nlargest or nsmallest functions.

Alternatively just build the heap and call heappop(). This should take O(n) time to build the heap and O(k*log(n)) to retrieve the k elements.

Here's a very simple and small benchmark:

In [1]: import random, heapq

In [2]: seq = [random.randint(-5000, 5000) for _ in range(35000)]

In [3]: %timeit sorted(seq)[:75]
100 loops, best of 3: 14.5 ms per loop

In [4]: %%timeit
   ...: s = seq[:]
   ...: heapq.nsmallest(75, s)
   ...: 
100 loops, best of 3: 4.05 ms per loop

In [5]: %%timeit
   ...: s = seq[:]
   ...: heapq.heapify(s)
   ...: for _ in range(75): heapq.heappop(s)
   ...: 
100 loops, best of 3: 2.41 ms per loop

I have no idea why nsmallest is so much slower then calling heappop directly. In fact I should have timed it without copying seq but still:

In [6]: %%timeit
   ...: heapq.nsmallest(75, seq)
   ...: 
100 loops, best of 3: 3.82 ms per loop

Increasing the length by 100 times:

In [12]: %timeit sorted(seq)[:75]
1 loops, best of 3: 1.9 s per loop

In [13]: %%timeit
    ...: heapq.nsmallest(75, seq)
    ...: 
1 loops, best of 3: 352 ms per loop

In [14]: %%timeit
    ...: s = seq[:]
    ...: heapq.heapify(s)
    ...: for _ in range(75): heapq.heappop(s)
    ...: 
1 loops, best of 3: 264 ms per loop

Note: to counter F.J biased profiling:

In [13]: a = list(range(1000000))

In [14]: random.shuffle(a)

In [15]: %timeit sorted(a)
1 loops, best of 3: 985 ms per loop

In [16]: %%timeit
    ...: s = a[:]
    ...: heapq.heapify(s)
    ...: 
1 loops, best of 3: 284 ms per loop

As you can see heapify is quite faster then sorting even on 1000000 elements lists.

roippi · Answer

Use heapq.nsmallest.

Maintaining the heap invariant is O(logk) where k is the size of the heap; you have to perform n push operations, making the overall complexity O(n logk). Compare this to sorting-and-taking-the-first-k-elements, which is overall complexity O(n logn). When k is small compared to n, the heapq approach clearly wins.

When k approaches n, you should instead just sort and take the first k - timsort is really good :-)

Is there a way to sort a list in python until the first sorted k elements are found?

Tags:

python

algorithm

sorting

quicksort

mergesort

Sofia Bravo

2 Answers

Bakuriu

roippi

Recent Activity

Donate For Us

Is there a way to sort a list in python until the first sorted k elements are found?

Tags:

python

algorithm

sorting

quicksort

mergesort

Sofia Bravo

2 Answers

Bakuriu

roippi

Related questions

Recent Activity

Donate For Us