Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get overall smallest elements' distribution in dataframe with sorted columns more efficiently

I have a dataframe with sorted columns, something like this:

df = pd.DataFrame({q: np.sort(np.random.randn(10).round(2)) for q in ['blue', 'green', 'red']})
       blue  green   red
    0 -2.15  -0.76 -2.62
    1 -0.88  -0.62 -1.65
    2 -0.77  -0.55 -1.51
    3 -0.73  -0.17 -1.14
    4 -0.06  -0.16 -0.75
    5 -0.03   0.05 -0.08
    6  0.06   0.38  0.37
    7  0.41   0.76  1.04
    8  0.56   0.89  1.16
    9  0.97   2.94  1.79

What I want to know is how many of the n smallest elements in the whole frame are in each column. This is the only thing I came up with:

is_small = df.isin(np.partition(df.values.flatten(), n)[:n])

with n=10 it looks like this:

        blue  green    red
    0   True   True   True
    1   True  False   True
    2   True  False   True
    3   True  False   True
    4  False  False   True
    5  False  False  False
    6  False  False  False
    7  False  False  False
    8  False  False  False
    9  False  False  False

Then by applying np.sum I get the number corresponding to each column.

I'm dissatisfied with this solution because it in no way utilizes the sortedness of the original data. All the data gets partitioned and all the data is then checked for whether it's in the partition. It seems wasteful, and I can't seem to figure out a better way.

like image 940
MegaBluejay Avatar asked Dec 20 '25 00:12

MegaBluejay


2 Answers

Think you can compare the largest of n-smallest values against the partitioned one and then use idxmin to leverage the sorted nature -

# Find largest of n smallest numbers
N = (np.partition(df.values.flatten(), n)[:n]).max()
out = (df<=N).idxmin(axis=0)

Sample run -

In [152]: np.random.seed(0)

In [153]: df = pd.DataFrame({q: np.sort(np.random.randn(10).round(2)) \
          for q in ['blue', 'green', 'red']})

In [154]: df
Out[154]: 
   blue  green   red
0 -0.98  -0.85 -2.55
1 -0.15  -0.21 -1.45
2 -0.10   0.12 -0.74
3  0.40   0.14 -0.19
4  0.41   0.31  0.05
5  0.95   0.33  0.65
6  0.98   0.44  0.86
7  1.76   0.76  1.47
8  1.87   1.45  1.53
9  2.24   1.49  2.27

In [198]: n = 5

In [199]: N = (np.partition(df.values.flatten(), n)[:n]).max()

In [200]: (df<=N).idxmin(axis=0)
Out[200]: 
blue     1
green    1
red      3
dtype: int64
like image 176
Divakar Avatar answered Dec 22 '25 13:12

Divakar


Lets say, you are looking at 10 smallest, you can stack and find value_count for the 10 smallest

df.stack().nsmallest(10).index.get_level_values(1).value_counts()

You get

red      5
blue     4
green    1
like image 44
Vaishali Avatar answered Dec 22 '25 13:12

Vaishali



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!