Pandas DataFrame, get 3 max values in the row and their column names

Question

on the forum exists a lot of examples, how is it possible to find the max value of the row with the corresponding column names. Some examples are here or here

What I want to do, is some specific modification of the above examples. My dataframe looks like this, where all columns are numerated from left to the right (this order is very important):

x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10
  0   0   1   2   2   0   0   0   0    0
  4   4   0   4   4   1   0   0   0    0
  0   0   1   2   3   0   0   0   0    0

Now, I want to create 6 new columns at the end of every row with the column name and the biggest values in the row.

x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 Max1 ValMax1 Max2 ValMax2 Max3 ValMax3
  0   0   1   2   2   0   0   0   0    0
  4   4   0   4   4   1   0   0   0    0
  0   0   1   2   3   0   0   0   0    0

If some row has more then only 1 max (for example the value 2 in the first row), I want to save in the column Max1 only one column name with the smallest index. In this case the second biggest value is also 2, but the corresponding column has greater index. It means, It is necessary to save in the "Max(y)" column only one column name. This is the main condition. In the case, if some row has more the 3 max values, it is necessary to save only 3 column names with the smallest indices. So the final output should be look like this DF:

x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 Max1 ValMax1 Max2 ValMax2 Max3 ValMax3
  0   0   1   2   2   0   0   0   0    0  x_4       2  x_5       2  x_3       1
  4   4   0   4   4   1   0   0   0    0  x_1       4  x_2       4  x_4       4
  0   0   1   2   3   0   0   0   0    0  x_5       3  x_4       2  x_3       1

So summarized we have the next result: in the first row 4 < 5, it means 4 comes first (anyway the second 2 comes immediately in the next column). in the second row 1 < 2 < 4 < 5, we have only 3 columns, so 5 is missing in the final result. in the third row, indices don't play any role, because we have strictly different values in the row. This is also the main condition.

mozway · Accepted Answer

For an efficient approach you need to vectorize, for that use numpy's argpartition and indexing:

import numpy as np

N = 3

# convert to arrays
# and reverse to preserve order
# of min index in case of a tie
cols = df.columns.to_numpy()[::-1]
a = df.loc[:, ::-1].to_numpy()

# get the top N indices
idx = np.argpartition(a, -N)[:, :-N-1:-1]

# get the top names 
names = cols[idx]

# get the top values
values = np.take_along_axis(a, idx, axis=1)
# or
values = a[np.arange(len(a))[:,None], idx]

# assign to new columns
df[[f'{x}{i+1}' for i in range(N) for x in ['Max', 'ValMax']]
  ] = (np.dstack([names,  values])
         .reshape(len(df), -1)
       )

Output:

   x_1  x_2  x_3  x_4  x_5  x_6  x_7  x_8  x_9  x_10 Max1 ValMax1 Max2 ValMax2 Max3 ValMax3
0    0    0    1    2    2    0    0    0    0     0  x_4       2  x_5       2  x_3       1
1    4    4    0    4    4    1    0    0    0     0  x_1       4  x_2       4  x_4       4
2    0    0    1    2    3    0    0    0    0     0  x_5       3  x_4       2  x_3       1

wjandrea · Answer

It seems to make more sense to do this in NumPy, then get the column names at the end.

I wrote a function you can use to get the top n indexes of an array. It works by using np.nanargmax then masking the values as NaN before doing it again. (There may be better ways to do this, but this is just what I thought of first.)

def argmax_n(arr: np.array, n: int, axis=None):
    arr = arr.astype('float')
    argmaxes = []
    for _ in range(n):
        argmax = np.nanargmax(arr, axis=axis, keepdims=True)
        argmaxes.append(argmax)
        np.put_along_axis(arr, argmax, np.NAN, axis=axis)
    return argmaxes

Used like so:

a = df.to_numpy()
argmax_3 = argmax_n(a, 3, axis=1)

Then you can build the DataFrame you want and .join it with the original if needed.

max_data = {}
for i, arg in enumerate(argmax_3, start=1):
    max_data[f'Max{i}'] = df.columns[arg.flatten()]
    max_data[f'ValMax{i}'] = np.take_along_axis(a, arg, axis=1).flatten()

pd.DataFrame(max_data)

  Max1  ValMax1 Max2  ValMax2 Max3  ValMax3
0  x_4        2  x_5        2  x_3        1
1  x_1        4  x_2        4  x_4        4
2  x_5        3  x_4        2  x_3        1

Pandas DataFrame, get 3 max values in the row and their column names

Tags:

pandas

dataframe

max

argmax

user495490

2 Answers

mozway

wjandrea

Recent Activity

Donate For Us

Pandas DataFrame, get 3 max values in the row and their column names

Tags:

pandas

dataframe

max

argmax

user495490

2 Answers

mozway

wjandrea

Related questions

Recent Activity

Donate For Us