Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dictionaries with numpy - Can I use XY coordinates as a hash? [duplicate]

Tags:

python

numpy

I need to be able to store a numpy array in a dict for caching purposes. Hash speed is important.

The array represents indicies, so while the actual identity of the object is not important, the value is. Mutabliity is not a concern, as I'm only interested in the current value.

What should I hash in order to store it in a dict?

My current approach is to use str(arr.data), which is faster than md5 in my testing.


I've incorporated some examples from the answers to get an idea of relative times:

In [121]: %timeit hash(str(y))
10000 loops, best of 3: 68.7 us per loop

In [122]: %timeit hash(y.tostring())
1000000 loops, best of 3: 383 ns per loop

In [123]: %timeit hash(str(y.data))
1000000 loops, best of 3: 543 ns per loop

In [124]: %timeit y.flags.writeable = False ; hash(y.data)
1000000 loops, best of 3: 1.15 us per loop

In [125]: %timeit hash((b*y).sum())
100000 loops, best of 3: 8.12 us per loop

It would appear that for this particular use case (small arrays of indicies), arr.tostring offers the best performance.

While hashing the read-only buffer is fast on its own, the overhead of setting the writeable flag actually makes it slower.

like image 794
sapi Avatar asked Jan 18 '26 04:01

sapi


1 Answers

You can simply hash the underlying buffer, if you make it read-only:

>>> a = random.randint(10, 100, 100000)
>>> a.flags.writeable = False
>>> %timeit hash(a.data)
100 loops, best of 3: 2.01 ms per loop
>>> %timeit hash(a.tostring())
100 loops, best of 3: 2.28 ms per loop

For very large arrays, hash(str(a)) is a lot faster, but then it only takes a small part of the array into account.

>>> %timeit hash(str(a))
10000 loops, best of 3: 55.5 us per loop
>>> str(a)
'[63 30 33 ..., 96 25 60]'
like image 107
Fred Foo Avatar answered Jan 20 '26 18:01

Fred Foo