I have created a distribution using numpy histogram and digitize functions.
_, bins = np.histogram(x, bins=bins)
arr = np.digitize(x, bins) - 1
x = bins[arr[:]]
Or possibly:
x = pandas.cut(x, bins=bins)
However as the distribution is very skewed, even after removing outliers, there are many bins with very little observations. I want to merge bins, somewhat similar to:
How to merge bins in R
The procedure would possibly involve pandas groupby and then merging the group sizes less than n to their neighbouring values. Is there a way to achieve this in pandas/numpy?
As promised, I implemented something in physt, version 0.3.5. You're welcome to use it.
See http://nbviewer.jupyter.org/github/janpipek/physt/blob/master/doc/Binning2.ipynb#Merging-bins and particularly http://nbviewer.jupyter.org/github/janpipek/physt/blob/master/doc/Binning2.ipynb#By-min-frequency
In your case, the workflow would be something like this:
import physt
histogram = physt.h1(x, bins=bins)
histogram.merge_bins(min_frequency=n)
bins = histogram.numpy_bins
Note that the code is in alpha stage and not each bin contains more than the required minimum (in order to preserve tall narrow bins). The best algorithm is still being looked for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With