I have to find the mode of a NumPy array that I read from an hdf5 file. The NumPy array is 1d and contains floating point values.
my_array=f1[ds_name].value
mod_value=scipy.stats.mode(my_array)
My array is 1d and contains around 1M values. It takes about 15 min for my script to return the mode value. Is there any way to make this faster?
Another question is why scipy.stats.median(my_array) does not work while mode works?
AttributeError: module 'scipy.stats' has no attribute 'median'
The implementation of scipy.stats.mode has a Python loop for handling the axis argument with multidimensional arrays. The following simple implementation, for one-dimensional arrays only, is faster:
def mode1(x):
values, counts = np.unique(x, return_counts=True)
m = counts.argmax()
return values[m], counts[m]
Here's an example. First, make an array of integers with length 1000000.
In [40]: x = np.random.randint(0, 1000, size=(2, 1000000)).sum(axis=0)
In [41]: x.shape
Out[41]: (1000000,)
Check that scipy.stats.mode and mode1 give the same result.
In [42]: from scipy.stats import mode
In [43]: mode(x)
Out[43]: ModeResult(mode=array([1009]), count=array([1066]))
In [44]: mode1(x)
Out[44]: (1009, 1066)
Now check the performance.
In [45]: %timeit mode(x)
2.91 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [46]: %timeit mode1(x)
39.6 ms ± 83.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.91 seconds for mode(x) and only 39.6 milliseconds for mode1(x).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With