How do I quickly get a weighted random instance of a Django model instance based on a weight field on that model?

Question

I'm using Postgres as the database backend but I don't think that will matter. Also, I'd like to still use sqlite3 as the local development database, so ideally any approach will work for both.

By "weighted", I mean that some items in that database table are more likely to appear than others based on a heuristic value ranging from 0 to +inf where 0 is "never going to be picked", 1 is "as equal to be picked as any other instance, and 2 is "twice as likely to be picked than any other instance".

I've read other SO posts about pulling random instances of models, but as far as I've seen there are no ways to do this quickly involving weights.

My model:

Has millions of instances.
Has a weight DecimalField which can be updated at any time, even during runtime.
Is not referenced anywhere except for using this random selection algorithm (so, for example, it can be deleted and recreated at any time with no problems).

What I'm after is a fast way to do this that is faster than the solutions I've tried or an explanation to why one of my solutions that I've tried is the fastest I can get.

Avoiding XY problem

I want to select "fresh" stuff from a database table but still give the chance of seeing some of the older content. If some content has been viewed too frequently or it is not well-received, it should appear less frequently. Ideally, I would be able to control how frequently this is: "ah, so this will appear 1.5 times more than other content on the site."

Stuff I've tried

Selecting randomly and trying again based on a probability roll

Count the total number of instances of the model. E.g.: 100.
Select a random instance of the model.
Roll a dice from 1 * weight / instances_count to determine if the item should be thrown back and a random choice be made again.

This seems like it's pretty slow and the random nature of the "throw back" might never terminate. In general, this is really janky and I would hate to use it. Putting it first as it is pretty "simple" and I'm most likely to dismiss it.

SELECTing every element ID and weight and using a random weight algorithm to select an ID

SELECT all of the rows.
Assign each row a range of IDs based on the weights.
Add up all of the weights dynamically.
Roll a sum_of_all_weights-sided dice.
Whatever is selected, pick an ID based on the weights.

Problem is, this algorithm seems slow for millions of rows. This is the "naive" solution.

Assigning a range of IDs depending on the weight and dynamically delete/re-create instances

Whenever something is added or the weight is changed, delete all instances containing the unique details of the instance and create weight more instances with the same meta-information.
Pick a random instance normally.

A caveat with this is that only integer-based weightings are possible. Also, the performance problem is offloaded from the SELECT to the INSERT and DELETE operations.

Rethink the entire model and introduce a "throw-back" value

Add a throw_back_probability field instead of weight.
If the probability is 0.0 it will never be thrown back. Otherwise, roll a die and throw back if required depending on the throw_back_probability.
Limit the algorithm to 3 "throw-back"s (or some arbitrary number).

This ultimately solves the problem but still probably requires more database calls and is an indirect solution.

It's SO, so I'm sure there are clever annotation-based solutions to this (or similar) that I'm overlooking. Thanks for the help in advance!

David Eisenstat · Accepted Answer

You can combine sharding with any of the approaches you list. Choose a number of shards (preferably with number of rows / number of shards significantly greater than log(number of rows) to avoid empty shards with high probability), assign each row a uniform random shard ID, and make the shard ID the first entry of the primary key so that the table is sorted by shard. To sample, choose a uniform random shard and then sample within the shard. This is inaccurate to the extent that the shard totals are unbalanced, but if the shards are large enough, then the law of large numbers will kick in. (If the shards are too large, though, then that starts to defeat the point of sharding.)

How do I quickly get a weighted random instance of a Django model instance based on a weight field on that model?

Tags:

performance

algorithm

random

python-3.x

django

Avoiding XY problem

Stuff I've tried

Selecting randomly and trying again based on a probability roll

SELECTing every element ID and weight and using a random weight algorithm to select an ID

Assigning a range of IDs depending on the weight and dynamically delete/re-create instances

Rethink the entire model and introduce a "throw-back" value

Thomas

1 Answers

David Eisenstat

Recent Activity

Donate For Us

How do I quickly get a weighted random instance of a Django model instance based on a weight field on that model?

Tags:

performance

algorithm

random

python-3.x

django

Avoiding XY problem

Stuff I've tried

Selecting randomly and trying again based on a probability roll

SELECTing every element ID and weight and using a random weight algorithm to select an ID

Assigning a range of IDs depending on the weight and dynamically delete/re-create instances

Rethink the entire model and introduce a "throw-back" value

Thomas

1 Answers

David Eisenstat

Related questions

Recent Activity

Donate For Us