I think it's easier to explain my problem with an example.
I have one table with ingredients for recipes and I have implemented a function to calculate the Tanimoto coefficient between ingredients. It's fast enough to calculate the coefficient between two ingredients (3 sql queries needed), but it does not scale well. To calculate the coefficient betweeen all possible ingredient's combination it needs N + (N*(N-1))/2 queries or 500500 queries for just 1k ingredients. Is there a faster way to do that? Here's what I got so far:
class Filtering():
  def __init__(self):
    self._connection=sqlite.connect('database.db')
  def n_recipes(self, ingredient_id):
    cursor = self._connection.cursor()
    cursor.execute('''select count(recipe_id) from recipe_ingredient
        where ingredient_id = ? ''', (ingredient_id, ))
    return cursor.fetchone()[0]
  def n_recipes_intersection(self, ingredient_a, ingredient_b):
    cursor = self._connection.cursor()
    cursor.execute('''select count(drink_id) from recipe_ingredient where
        ingredient_id = ? and recipe_id in (
        select recipe_id from recipe_ingredient
        where ingredient_id = ?) ''', (ingredient_a, ingredient_b))
    return cursor.fetchone()[0]
  def tanimoto(self, ingredient_a, ingredient_b):
    n_a, n_b = map(self.n_recipes, (ingredient_a, ingredient_b))
    n_ab = self.n_recipes_intersection(ingredient_a, ingredient_b)
    return float(n_ab) / (n_a + n_b - n_ab)
Why aren't you simply fetching all recipes into memory and then computing Tanimoto coefficients in memory?
It's simpler and it's much, much faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With