I have a dictionary which uses a 4-tuple as it's key. I need to find all the keys in the dictionary which partially match some other tuple. I have some code which does this but it's slow and needs optimizing.
Here's what I'm after:
Keys:
(1, 2, 3, 4)
(1, 3, 5, 2)
(2, 4, 8, 7)
(1, 4, 3, 4)
Match:
(1, None, 3, None)
Result:
[(1, 2, 3, 4), (1, 4, 3, 4)]
Current code:
def GetTuples(self, keyWords):
    tuples = []
    for k in self.chain.iterkeys():
        match = True
        for i in range(self.order):
            if keyWords[i] is not None and keyWords[i] != k[i]:
                match = False
                break
        if match is True:
            tuples.append(k)
    return tuples
I'm basically looking for either optimizations to this method, or a better way of storing this data.
You can't. Keys have to be unique.
No, each key in a dictionary should be unique. You can't have two keys with the same value. Attempting to use the same key again will just overwrite the previous value stored. If a key needs to store multiple values, then the value associated with the key should be a list or another dictionary.
Remove multiple keys from a dictionary using del Here, we are using a Python loop to iterate rem_list, while iterating if a key matches with the test_dict we delete it using the del keyword.
Creating a Dictionary To do that you separate the key-value pairs by a colon(“:”). The keys would need to be of an immutable type, i.e., data-types for which the keys cannot be changed at runtime such as int, string, tuple, etc. The values can be of any type.
You can't optimize this further if you store your data in a plain dictionary, since it does not provide anything faster then a sequential access to all elements of your dictionary in some unpredictable order. This means that your solution is not faster then O(n).
Now, databases. Database is not a universal solution to any (complex enough) problem. Can you reliably estimate the speed/complexity of such lookups for a database? If you scroll to the bottom of this reply, you will see that for large data sets database performance can be much worse than a smart data structure.
What you need here is a hand-crafted data structure. There is a number of choices, it strongly depends on other stuff you are doing with this data. For example: you can keep N sets of sorted lists of your keys, each sorted by n-th tuple element. Then you can quickly select N sorted sets of elements matching only one tuple element at position n, and find their intersection to get the results. This would give an average performance of O(log n)*O(m) where m is an average number of elements in one subset.
Or you can store you items in a k-d tree, this means that you have to pay O(log n) insertion price, but you can  do queries like the one above in O(log n) time. Here is an example in python, using k-d tree implementation from SciPy:
from scipy.spatial import kdtree
import itertools
import random
random.seed(1)
data = list(itertools.permutations(range(10), 4))
random.shuffle(data)
data = data[:(len(data)/2)]
tree = kdtree.KDTree(data)
def match(a, b):
    assert len(a) == len(b)
    for i, v in enumerate(a):
        if v != b[i] and (v is not None) and (b[i] is not None):
            return False
    return True
def find_like(kdtree, needle):
    assert len(needle) == kdtree.m
    def do_find(tree, needle):
        if hasattr(tree, 'idx'):
            return list(itertools.ifilter(lambda x: match(needle, x),
                                          kdtree.data[tree.idx]))
        if needle[tree.split_dim] is None:
            return do_find(tree.less, needle) + do_find(tree.greater, needle)
        if needle[tree.split_dim] <= tree.split:
            return do_find(tree.less, needle)
        else:
            return do_find(tree.greater, needle)
    return do_find(kdtree.tree, needle)
def find_like_bf(kdtree, needle):
    assert len(needle) == kdtree.m
    return list(itertools.ifilter(lambda x: match(needle, x),
                                  kdtree.data))
import timeit
print "k-d tree:"
print "%.2f sec" % timeit.timeit("find_like(tree, (1, None, 2, None))",
                                "from __main__ import find_like, tree",
                                number=1000)
print "brute force:"
print "%.2f sec" % timeit.timeit("find_like_bf(tree, (1, None, 2, None))",
                                "from __main__ import find_like_bf, tree",
                                number=1000)
And test run results:
$ python lookup.py
k-d tree:
0.89 sec
brute force:
6.92 sec
Just for fun, also added database-based solution benchmark. The initialization code changed from above to:
random.seed(1)
data = list(itertools.permutations(range(30), 4))
random.shuffle(data)
Now, the "database" implementation:
import sqlite3
db = sqlite3.connect(":memory:")
db.execute("CREATE TABLE a (x1 INTEGER, x2 INTEGER, x3 INTEGER, x4 INTEGER)")
db.execute("CREATE INDEX x1 ON a(x1)")
db.execute("CREATE INDEX x2 ON a(x2)")
db.execute("CREATE INDEX x3 ON a(x3)")
db.execute("CREATE INDEX x4 ON a(x4)")
db.executemany("INSERT INTO a VALUES (?, ?, ?, ?)",
               [[int(x) for x in value] for value in tree.data])
def db_test():
    cur = db.cursor()
    cur.execute("SELECT * FROM a WHERE x1=? AND x3=?", (1, 2))
    return cur.fetchall()
print "sqlite db:"
print "%.2f sec" % timeit.timeit("db_test()",
                                 "from __main__ import db_test",
                                 number=100)
And test results, reduced for 100 runs per benchmark (for resulting 657720-element set of keys):
$ python lookup.py
building tree
done in 6.97 sec
building db
done in 11.59 sec
k-d tree:
1.90 sec
sqlite db:
2.31 sec
It also worth mentioning that building tree took almost twice less time then inserting this test data set into the database.
Complete source here: https://gist.github.com/1261449
What about just using a database?
I prefer SQLite + SQLAlchemy even for simple projects, but plain sqlite3 might have a gentler learning curve.
Putting an index on each key column should take care of speed issues.
Perhaps you could speed it up by maintaining indexes for your keys. Essentially, something like this:
self.indices[2][5]
would contain a set of all of the keys which have 5 in the third position of the key.
Then you could simply do set intersection between the relevant index entries to get the set of keys:
matching_keys = None
for i in range(self.order):
    if keyWords[i] is not None:
        if matching_keys is None:
            matching_keys = self.indices[i][keyWords[i]]
        else:
            matching_keys &= self.indices[i][keyWords[i]]
matching_keys = list(matching_keys) if matching_keys else []
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With