Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe finding largest N elements of each row with row-specific N

I have a DataFrame:

>>> df = pd.DataFrame({'row1' : [1,2,np.nan,4,5], 'row2' : [11,12,13,14,np.nan], 'row3':[22,22,23,24,25]}, index = 'a b c d e'.split()).T
>>> df
         a     b     c     d     e
row1   1.0   2.0   NaN   4.0   5.0
row2  11.0  12.0  13.0  14.0   NaN
row3  22.0  22.0  23.0  24.0  25.0

and a Series that specifies the number of top N values I want from each row

>>> n_max = pd.Series([2,3,4])

What is Panda's way of using df and n_max to find the largest N elements of each (breaking ties with a random pick, just as .nlargest() would do)?

The desired output is

         a     b     c     d     e
row1   NaN   NaN   NaN   4.0   5.0
row2   NaN  12.0  13.0  14.0   NaN
row3  22.0   NaN  23.0  24.0  25.0

I know how to do this with a uniform/fixed N across all rows (say, N=4). Note the tie-breaking in row3:

>>> df.stack().groupby(level=0).nlargest(4).unstack().reset_index(level=1, drop=True).reindex(columns=df.columns)
         a     b     c     d     e
row1   1.0   2.0   NaN   4.0   5.0
row2  11.0  12.0  13.0  14.0   NaN
row3  22.0   NaN  23.0  24.0  25.0

But the goal, again, is to have row-specific N. Looping through each row obviously doesn't count (for performance reasons). And I've tried using .rank() with a mask but tie breaking doesn't work there...

like image 780
Zhang18 Avatar asked Oct 28 '25 01:10

Zhang18


1 Answers

Based on @ScottBoston's comment on the OP, it is possible to use the following mask based on rank to solve this problem:

>>> n_max.index = df.index
>>> df_rank = df.stack(dropna=False).groupby(level=0).rank(ascending=False, method='first').unstack()
>>> selected = df_rank.le(n_max, axis=0)
>>> df[selected]
         a     b     c     d     e
row1   NaN   NaN   NaN   4.0   5.0
row2   NaN  12.0  13.0  14.0   NaN
row3  22.0   NaN  23.0  24.0  25.0
like image 116
Zhang18 Avatar answered Oct 31 '25 00:10

Zhang18