What's the idiomatic way to fake __hash__() for dicts?

Question

EDIT: as @BrenBarn pointed out, the original didn't make sense.

Given a list of dicts (courtesy of csv.DictReader--they all have str keys and values) it'd be nice to remove duplicates by stuffing them all in a set, but this can't be done directly since dict isn't hashable. Some existing questions touch on how to fake __hash__() for sets/dicts but don't address which way should be preferred.

# i. concise but ugly round trip
filtered = [eval(x) for x in {repr(d) for d in pile_o_dicts}]

# ii. wordy but avoids round trip
filtered = []
keys = set()
for d in pile_o_dicts:
    key = str(d)
    if key not in keys:
        keys.add(key)
        filtered.append(d)

# iii. introducing another class for this seems Java-like?
filtered = {hashable_dict(x) for x in pile_o_dicts}

# iv. something else entirely

In the spirit of the Zen of Python what's the "obvious way to do it"?

senderle · Accepted Answer

Based on your example code, I take your question to be something slightly different from what you literally say. You don't actually want to override __hash__() -- you just want to filter out duplicates in linear time, right? So you need to ensure the following for each dictionary: 1) every key-value pair is represented, and 2) they are represented in a stable order. You could use a sorted tuple of key-value pairs, but instead, I would suggest using frozenset. frozensets are hashable, and they avoid the overhead of sorting, which should improve performance (as this answer seems to confirm). The downside is that they take up more memory than tuples, so there is a space/time tradeoff here.

Also, your code uses sets to do the filtering, but that doesn't make a lot of sense. There's no need for that ugly eval step if you use a dictionary:

filtered = {frozenset(d.iteritems()):d for d in pile_o_dicts}.values()

Or in Python 3, assuming you want a list rather than a dictionary view:

filtered = list({frozenset(d.items()):d for d in pile_o_dicts}.values())

These are both bit clunky. For readability, consider breaking it into two lines:

dict_o_dicts = {frozenset(d.iteritems()):d for d in pile_o_dicts}
filtered = dict_o_dicts.values()

The alternative is an ordered tuple of tuples:

filtered = {tuple(sorted(d.iteritems())):d for d in pile_o_dicts}.values()

And a final note: don't use repr for this. Dictionaries that evaluate as equal can have different representations:

>>> d1 = {str(i):str(i) for i in range(300)}
>>> d2 = {str(i):str(i) for i in range(299, -1, -1)}
>>> d1 == d2
True
>>> repr(d1) == repr(d2)
False

Raymond Hettinger · Answer

The artfully named pile_o_dicts can be converted to a canonical form by sorting their items lists:

 groups = {}
 for d in pile_o_dicts:
     k = tuple(sorted(d.items()))
     groups.setdefault(k, []).append(d)

This will group identical dictionaries together.

FWIW, the technique of using sorted(d.items()) is currently used in the standard library for functools.lru_cache() in order to recognize function calls that have the same keyword arguments. IOW, this technique is tried and true :-)

What's the idiomatic way to fake hash() for dicts?

Tags:

python

python-3.x

everial

2 Answers

senderle

Raymond Hettinger

Recent Activity

Donate For Us