In python2.7, I'm successfully using hash() to place objects into buckets stored persistently on disk. A mockup code looks like this:
class PersistentDict(object):
def __setitem__(self, key, value):
bucket_index = (hash(key)&0xffffffff) % self.bucket_count
self._store_to_bucket(bucket_index, key, value)
def __getitem__(self, key):
bucket_index = (hash(key)&0xffffffff) % self.bucket_count
return self._fetch_from_bucket(bucket_index)[key]
In python3, hash() uses a random or fixed salt, which makes it unusable/suboptimal for this [1]. Apparently, it's not possible to use a fixed salt for specific invocations. So, I need an alternative:
dict/set)I've already tried using hash functions from hashlib (slow!) and checksums from zlib (apparently not ideal for hashing, but meh) which work fine with strings/bytes. However, they work only on bytes-like objects, whereas hash() works with almost everything.
[1] Using hash() to identify buckets is either:
PersistentDicts were created with different saltsI've had success using a combination of hash and zlib.adler32. The most straightforward implementation is this:
def hashkey(obj, salt=0):
"""
Create a key suitable for use in hashmaps
:param obj: object for which to create a key
:type: str, bytes, :py:class:`datetime.datetime`, object
:param salt: an optional salt to add to the key value
:type salt: int
:return: numeric key to `obj`
:rtype: int
"""
if obj is None:
return 0
if isinstance(obj, str):
return zlib.adler32(obj.encode(), salt) & 0xffffffff
elif isinstance(obj, bytes):
return zlib.adler32(obj, salt) & 0xffffffff
elif isinstance(obj, datetime_type):
return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
return hash(obj) & 0xffffffff
With Python 3.4.3, this is a lot slower than calling plain hash, which takes roughly 0.07 usec. For a regular object, hashkey takes ~1.0 usec instead. 0.8 usec for bytes and 0.7 for str.
Overhead is roughly as follows:
hash(obj) vs def pyhash(obj): return hash(obj))isinstancezlib.adler32 or zlib.crc32 vs hash: ~0.160 usec vs ~ 0.75 usec (adler and crc are +/- 4 usec)obj.encode() of str objects ("foobar")str(obj).encode() of datetime.datetime objectsThe most optimization comes from ordering of the if statements. If one mostly expects plain objects, the following is the fastest I could come up with:
def hashkey_c(obj, salt=0):
if obj.__class__ in hashkey_c.types:
if obj is None:
return 0
if obj.__class__ is str:
return zlib.adler32(obj.encode(), salt) & 0xffffffff
elif obj.__class__ is bytes:
return zlib.adler32(obj, salt) & 0xffffffff
elif obj.__class__ is datetime_type:
return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
return hash(obj) & 0xffffffff
hashkey_c.types = {str, bytes, datetime_type, type(None)}
Total time: ~0.7 usec for str and bytes, abysmal for datetime, 0.35 usec for objects, ints, etc. Using a dict to map type to hash comparable, if one uses an explicit check on the dict keys (aka types) separately (i.e. not obj.__class__ in hashkey.dict_types but obj.__class__ in hashkey.explicit_dict_types).
Some additional notes:
hash is not stable across interpreter starts for any object using the default __hash__ implementation, including None__hash__) containing a salted type, e.g. (1, 2, 'three')A good alternative is xxhash. It provides fast integer hashes (as Python's hash), but they are consistent across multiple runs and machines:
import xxhash
print(xxhash.xxh32_intdigest('hash me please'))
print(xxhash.xxh64_intdigest('hash me please'))
print(xxhash.xxh128_intdigest('hash me please'))
obj = [0, 1, 2]
print(xxhash.xxh64_intdigest(str(obj)))
print(xxhash.xxh64_intdigest(str(obj), seed=0))
print(xxhash.xxh64_intdigest(str(obj), seed=42))
'''
Prints:
465606393
6686454294630346756
110192986912562192471431245034848549222
11514819435353980464
11514819435353980464
11772420285327955252
'''
Since it hashes strings, for arbitrary objects you may hash str(your_object) or repr(your_object). You can also set an optional seed parameter that defaults to zero, as shown in the example.
Yet another alternative is pyhash. It works fine on Linux, but after some bad experiences installing it on Windows machines, I recommend xxhash which works well on all platforms and is even faster than pyhash for some small examples that I tested.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With