Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas cut function gives fewer categories than desired

I have a df that looks likes this

   var1  var2  var3  var4  var5  var6
0    0.3   0.6   0.7   0.8   0.7   0.5
1    0.7   0.6   0.4   0.6   0.7   1.0
2    0.0   0.0   0.0   0.0   0.0   0.0
3    0.1   0.9   0.5   0.7   0.7   0.9
4    0.3   2.3   0.4   2.0   1.9   1.9
5    4.0   1.2   0.6   1.2   2.6   3.1
6    0.0   0.0   0.0   0.0   0.0   0.0
7    0.0   0.2   0.1   0.2   0.2   0.2
8    0.1   0.1   0.1   0.1   0.1   0.1
9    0.0   0.0   0.0   0.0   0.0   0.0
10   0.1   0.1   0.1   0.2   0.1   0.1
11   0.0   0.0   0.0   0.0   0.0   0.1
12   0.0   0.0   0.0   0.0   0.0   0.0
13   0.0   0.0   0.0   0.0   0.0   0.0

I want to create 4 bins (strictly 4 bins) for every column so i apply the pandas cut function in each column separately. So I do

import pandas as pd
qt = so.apply(lambda x: pd.cut(x,4))

Then if I do

qt.var1.unique()

I get

[(-0.004, 1.0], (3.0, 4.0]]
Categories (2, interval[float64]): [(-0.004, 1.0] < (3.0, 4.0]]

Which has only 2 categories.

Any ideas why this happens ?

like image 936
quant Avatar asked Nov 20 '25 15:11

quant


2 Answers

For var1 you split the data in equal-width bins in the range of var1. So you have a range from 0 to 4 so you get the intervals:

Categories (4, interval[float64]): [(-0.004, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 4.0]]

unique only shows 2, because there are only values in 2 of the 4 intervals.

Explanation for -0.004:

The range of x is extended by .1% on each side to include the minimum and maximum values of x.

like image 199
luigigi Avatar answered Nov 23 '25 06:11

luigigi


The documentation specify that the bins have the same width:

Defines the number of equal-width bins in the range of x...

In your case, you can not create 4 equal bins to fit your data in. Here an example:

>>> a = np.arange(12)
>>> print(len(pd.cut(a, 4).unique()))
4

>>> b = np.array([1,2,3, 10, 20])
>>> print(len(pd.cut(b, 4).unique()))
3

As you can see, in the latter case only 4 bins are created, but only 3 are used

like image 40
Andrea Avatar answered Nov 23 '25 05:11

Andrea