Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame, get 3 max values in the row and their column names

on the forum exists a lot of examples, how is it possible to find the max value of the row with the corresponding column names. Some examples are here or here

What I want to do, is some specific modification of the above examples. My dataframe looks like this, where all columns are numerated from left to the right (this order is very important):

x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10
  0   0   1   2   2   0   0   0   0    0
  4   4   0   4   4   1   0   0   0    0
  0   0   1   2   3   0   0   0   0    0

Now, I want to create 6 new columns at the end of every row with the column name and the biggest values in the row.

x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 Max1 ValMax1 Max2 ValMax2 Max3 ValMax3
  0   0   1   2   2   0   0   0   0    0
  4   4   0   4   4   1   0   0   0    0
  0   0   1   2   3   0   0   0   0    0

If some row has more then only 1 max (for example the value 2 in the first row), I want to save in the column Max1 only one column name with the smallest index. In this case the second biggest value is also 2, but the corresponding column has greater index. It means, It is necessary to save in the "Max(y)" column only one column name. This is the main condition. In the case, if some row has more the 3 max values, it is necessary to save only 3 column names with the smallest indices. So the final output should be look like this DF:

x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 Max1 ValMax1 Max2 ValMax2 Max3 ValMax3
  0   0   1   2   2   0   0   0   0    0  x_4       2  x_5       2  x_3       1
  4   4   0   4   4   1   0   0   0    0  x_1       4  x_2       4  x_4       4
  0   0   1   2   3   0   0   0   0    0  x_5       3  x_4       2  x_3       1

So summarized we have the next result: in the first row 4 < 5, it means 4 comes first (anyway the second 2 comes immediately in the next column). in the second row 1 < 2 < 4 < 5, we have only 3 columns, so 5 is missing in the final result. in the third row, indices don't play any role, because we have strictly different values in the row. This is also the main condition.

like image 960
user495490 Avatar asked Oct 22 '25 12:10

user495490


2 Answers

For an efficient approach you need to vectorize, for that use numpy's argpartition and indexing:

import numpy as np

N = 3

# convert to arrays
# and reverse to preserve order
# of min index in case of a tie
cols = df.columns.to_numpy()[::-1]
a = df.loc[:, ::-1].to_numpy()

# get the top N indices
idx = np.argpartition(a, -N)[:, :-N-1:-1]

# get the top names 
names = cols[idx]

# get the top values
values = np.take_along_axis(a, idx, axis=1)
# or
values = a[np.arange(len(a))[:,None], idx]

# assign to new columns
df[[f'{x}{i+1}' for i in range(N) for x in ['Max', 'ValMax']]
  ] = (np.dstack([names,  values])
         .reshape(len(df), -1)
       )

Output:

   x_1  x_2  x_3  x_4  x_5  x_6  x_7  x_8  x_9  x_10 Max1 ValMax1 Max2 ValMax2 Max3 ValMax3
0    0    0    1    2    2    0    0    0    0     0  x_4       2  x_5       2  x_3       1
1    4    4    0    4    4    1    0    0    0     0  x_1       4  x_2       4  x_4       4
2    0    0    1    2    3    0    0    0    0     0  x_5       3  x_4       2  x_3       1
like image 172
mozway Avatar answered Oct 24 '25 10:10

mozway


It seems to make more sense to do this in NumPy, then get the column names at the end.

I wrote a function you can use to get the top n indexes of an array. It works by using np.nanargmax then masking the values as NaN before doing it again. (There may be better ways to do this, but this is just what I thought of first.)

def argmax_n(arr: np.array, n: int, axis=None):
    arr = arr.astype('float')
    argmaxes = []
    for _ in range(n):
        argmax = np.nanargmax(arr, axis=axis, keepdims=True)
        argmaxes.append(argmax)
        np.put_along_axis(arr, argmax, np.NAN, axis=axis)
    return argmaxes

Used like so:

a = df.to_numpy()
argmax_3 = argmax_n(a, 3, axis=1)

Then you can build the DataFrame you want and .join it with the original if needed.

max_data = {}
for i, arg in enumerate(argmax_3, start=1):
    max_data[f'Max{i}'] = df.columns[arg.flatten()]
    max_data[f'ValMax{i}'] = np.take_along_axis(a, arg, axis=1).flatten()

pd.DataFrame(max_data)
  Max1  ValMax1 Max2  ValMax2 Max3  ValMax3
0  x_4        2  x_5        2  x_3        1
1  x_1        4  x_2        4  x_4        4
2  x_5        3  x_4        2  x_3        1
like image 40
wjandrea Avatar answered Oct 24 '25 10:10

wjandrea



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!