Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomly Sample a python Set without converting to list

Tags:

python

random

set

Question

I have spent a good amount of time reading various answers about getting a random sample in python, and random.sample seems to be the natural and most common choice, however I am attempting to sample from a python set object and was hoping to do it efficiently.

I am using a set due to the very nice and efficient set functionality in python (intersections, difference, etc.). For my purposes sets are a very efficient data structure, and lists are specifically not. I have an algorithm situation where I have N elements in a set, and potentially need to take up to N subsamples of an arbitrary size for each sampling of the set. Each subsampling of the set is not of the exact same set, and is defined by the properties of each element I must generate a subsample of. Here is some vague code which demonstrates the complexity of the algorithm:

main_set = set(...) # Values sourced from elsewhere.
capacity = 20

for element in list:
    potential_values = main_set - element.set # Exclude values already in element
    sample_size = capacity - len(element.set) # Num needed to fill the set to capacity
    new_vals = sample(potential_values, sample_size) # <- insert sampling idea here

    element.set = element.set | new_vals # Union of sample and element set

From what I have gathered online and in some tests, random.sample appears to convert a set into a list object. The size of main_set - element.set, potential_values is almost always much greater than the size of element.set, and so if potential_values must be transformed into a list on each sampling the algorithm will suffer performance immensely.

So does anyone have any advice or ideas on how to go about doing this efficiently with sets? I appreciate any input on this matter, and before anyone jumps to the 'premature optimization' routine, I have a very good idea of the scale on which this is going to be executing and the difference between O(n) and O(n^2) is pretty substantial.


Clarification Edit:

I specifically do not care about the output of whatever sample()method provided. The actual samples I am pulling from the potential_values are small compared to the size of potential_values. Rather, all of the suggested sample() methods require a list-like input to work, meaning potential_values must be converted to an indexable type first, which is what I wanted to avoid.

Also I realize now that I brought up big-O notation in a really vague way and probably shouldn't have. When I mean I wanted to avoid O(n^2), I really meant I wanted to avoid adding another O(n) operation inside the loop. As it was pointed out to me main_set - element.set is of same time-complexity as list(main_set), so it is already O(n^2). Adding list conversion makes the whole algorithm more like O(2n^2), but none of that is really important.

like image 739
mgallagher Avatar asked Jan 26 '26 21:01

mgallagher


2 Answers

You can use heapq.nlargest which can take any iterable and provide it with a random key to pick, eg:

import random, heapq

sample = heapq.nlargest(sample_size, your_set, key=lambda L: random.random())

Note - this will give you a list object back, so you'll need to cater to convert it if necessary...

like image 129
Jon Clements Avatar answered Jan 28 '26 11:01

Jon Clements


A quick attempt at timing in IPython suggests that using heapq.nlargest isn't necessarily better than your existing method, adjust to the characteristics of your actual data as appropriate:

import random
import heapq

set_size = 100000
sample_size = 1000

def sample_heapq(your_set, sample_size):
    sample = heapq.nlargest(sample_size, your_set, key = lambda e: random.random())
    return sample

def sample_original(your_set, sample_size):
    sample = random.sample(your_set, sample_size)
    return sample

eg_set = set(range(sample_size))

Running these through timeit:

%timeit sample_heapq(eg_set, sample_size)
1000 loops, best of 3: 523 µs per loop

%timeit sample_original(eg_set, sample_size)
1000 loops, best of 3: 479 µs per loop
like image 37
Marius Avatar answered Jan 28 '26 09:01

Marius



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!