Get overall smallest elements' distribution in dataframe with sorted columns more efficiently

Question

I have a dataframe with sorted columns, something like this:

df = pd.DataFrame({q: np.sort(np.random.randn(10).round(2)) for q in ['blue', 'green', 'red']})
       blue  green   red
    0 -2.15  -0.76 -2.62
    1 -0.88  -0.62 -1.65
    2 -0.77  -0.55 -1.51
    3 -0.73  -0.17 -1.14
    4 -0.06  -0.16 -0.75
    5 -0.03   0.05 -0.08
    6  0.06   0.38  0.37
    7  0.41   0.76  1.04
    8  0.56   0.89  1.16
    9  0.97   2.94  1.79

What I want to know is how many of the n smallest elements in the whole frame are in each column. This is the only thing I came up with:

is_small = df.isin(np.partition(df.values.flatten(), n)[:n])

with n=10 it looks like this:

        blue  green    red
    0   True   True   True
    1   True  False   True
    2   True  False   True
    3   True  False   True
    4  False  False   True
    5  False  False  False
    6  False  False  False
    7  False  False  False
    8  False  False  False
    9  False  False  False

Then by applying np.sum I get the number corresponding to each column.

I'm dissatisfied with this solution because it in no way utilizes the sortedness of the original data. All the data gets partitioned and all the data is then checked for whether it's in the partition. It seems wasteful, and I can't seem to figure out a better way.

Divakar · Accepted Answer

Think you can compare the largest of n-smallest values against the partitioned one and then use idxmin to leverage the sorted nature -

# Find largest of n smallest numbers
N = (np.partition(df.values.flatten(), n)[:n]).max()
out = (df<=N).idxmin(axis=0)

Sample run -

In [152]: np.random.seed(0)

In [153]: df = pd.DataFrame({q: np.sort(np.random.randn(10).round(2)) \
          for q in ['blue', 'green', 'red']})

In [154]: df
Out[154]: 
   blue  green   red
0 -0.98  -0.85 -2.55
1 -0.15  -0.21 -1.45
2 -0.10   0.12 -0.74
3  0.40   0.14 -0.19
4  0.41   0.31  0.05
5  0.95   0.33  0.65
6  0.98   0.44  0.86
7  1.76   0.76  1.47
8  1.87   1.45  1.53
9  2.24   1.49  2.27

In [198]: n = 5

In [199]: N = (np.partition(df.values.flatten(), n)[:n]).max()

In [200]: (df<=N).idxmin(axis=0)
Out[200]: 
blue     1
green    1
red      3
dtype: int64

Vaishali · Answer

Lets say, you are looking at 10 smallest, you can stack and find value_count for the 10 smallest

df.stack().nsmallest(10).index.get_level_values(1).value_counts()

You get

red      5
blue     4
green    1

Get overall smallest elements' distribution in dataframe with sorted columns more efficiently

Tags:

python

python-3.x

pandas

numpy

MegaBluejay

2 Answers

Divakar

Vaishali

Recent Activity

Donate For Us

Get overall smallest elements' distribution in dataframe with sorted columns more efficiently

Tags:

python

python-3.x

pandas

numpy

MegaBluejay

2 Answers

Divakar

Vaishali

Related questions

Recent Activity

Donate For Us