I have spent a good amount of time reading various answers about getting a random sample in python, and random.sample seems to be the natural and most common choice, however I am attempting to sample from a python set object and was hoping to do it efficiently.
I am using a set due to the very nice and efficient set functionality in python (intersections, difference, etc.). For my purposes sets are a very efficient data structure, and lists are specifically not. I have an algorithm situation where I have N elements in a set, and potentially need to take up to N subsamples of an arbitrary size for each sampling of the set. Each subsampling of the set is not of the exact same set, and is defined by the properties of each element I must generate a subsample of. Here is some vague code which demonstrates the complexity of the algorithm:
main_set = set(...) # Values sourced from elsewhere.
capacity = 20
for element in list:
potential_values = main_set - element.set # Exclude values already in element
sample_size = capacity - len(element.set) # Num needed to fill the set to capacity
new_vals = sample(potential_values, sample_size) # <- insert sampling idea here
element.set = element.set | new_vals # Union of sample and element set
From what I have gathered online and in some tests, random.sample appears to convert a set into a list object. The size of main_set - element.set, potential_values is almost always much greater than the size of element.set, and so if potential_values must be transformed into a list on each sampling the algorithm will suffer performance immensely.
So does anyone have any advice or ideas on how to go about doing this efficiently with sets? I appreciate any input on this matter, and before anyone jumps to the 'premature optimization' routine, I have a very good idea of the scale on which this is going to be executing and the difference between O(n) and O(n^2) is pretty substantial.
I specifically do not care about the output of whatever sample()method provided. The actual samples I am pulling from the potential_values are small compared to the size of potential_values. Rather, all of the suggested sample() methods require a list-like input to work, meaning potential_values must be converted to an indexable type first, which is what I wanted to avoid.
Also I realize now that I brought up big-O notation in a really vague way and probably shouldn't have. When I mean I wanted to avoid O(n^2), I really meant I wanted to avoid adding another O(n) operation inside the loop. As it was pointed out to me main_set - element.set is of same time-complexity as list(main_set), so it is already O(n^2). Adding list conversion makes the whole algorithm more like O(2n^2), but none of that is really important.
You can use heapq.nlargest which can take any iterable and provide it with a random key to pick, eg:
import random, heapq
sample = heapq.nlargest(sample_size, your_set, key=lambda L: random.random())
Note - this will give you a list object back, so you'll need to cater to convert it if necessary...
A quick attempt at timing in IPython suggests that using heapq.nlargest isn't necessarily better than your existing method, adjust to the characteristics of your actual data as appropriate:
import random
import heapq
set_size = 100000
sample_size = 1000
def sample_heapq(your_set, sample_size):
sample = heapq.nlargest(sample_size, your_set, key = lambda e: random.random())
return sample
def sample_original(your_set, sample_size):
sample = random.sample(your_set, sample_size)
return sample
eg_set = set(range(sample_size))
Running these through timeit:
%timeit sample_heapq(eg_set, sample_size)
1000 loops, best of 3: 523 µs per loop
%timeit sample_original(eg_set, sample_size)
1000 loops, best of 3: 479 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With