From my understanding, numpy's percentile compute the qth percentiles of the data.
But how does it do exactly?
Say, given x = np.array([1.3, 1.7, 2.4, 2.8, 3.5, 5.6, 6.6, 7.7, 8.8, 9.9]) (10 floats inside).
if I do np.percentile(x, 100), it gives back 9.9000000000000004.
if I do np.percentile(x, 90), it should returns 8.8, right? But it gives back 8.9100000000000001.
Why there are such diffs? Are these diffs acceptable?
Since version 1.9.0, Numpy's percentile function has an interpolation parameter which is described in the docs like this:
interpolation : {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}
This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:
- linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
- lower: i.
- higher: j.
- nearest: i or j whichever is nearest.
- midpoint: (i + j) / 2.
It defaults to linear. If you want to get 8.8 from your example, run:
np.percentile(x, 90, interopolation='lower')
From my understanding, the 90%-percentile does not have to be an item from the input array.
From the documentation:
Given a vector V of length N, the q-th percentile of V is the q-th ranked value in a sorted copy of V. The values and distances of the two nearest neighbors as well as the interpolation parameter will determine the percentile if the normalized ranking does not match q exactly. This function is the same as the median if q=50, the same as the minimum if q=0 and the same as the maximum if q=100.
The issue with float representation (which is responsible for the slight difference in np.percentile(x, 100) compared to 9.9) is well known.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With