on the forum exists a lot of examples, how is it possible to find the max value of the row with the corresponding column names. Some examples are here or here
What I want to do, is some specific modification of the above examples. My dataframe looks like this, where all columns are numerated from left to the right (this order is very important):
x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10
0 0 1 2 2 0 0 0 0 0
4 4 0 4 4 1 0 0 0 0
0 0 1 2 3 0 0 0 0 0
Now, I want to create 6 new columns at the end of every row with the column name and the biggest values in the row.
x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 Max1 ValMax1 Max2 ValMax2 Max3 ValMax3
0 0 1 2 2 0 0 0 0 0
4 4 0 4 4 1 0 0 0 0
0 0 1 2 3 0 0 0 0 0
If some row has more then only 1 max (for example the value 2 in the first row), I want to save in the column Max1 only one column name with the smallest index. In this case the second biggest value is also 2, but the corresponding column has greater index. It means, It is necessary to save in the "Max(y)" column only one column name. This is the main condition. In the case, if some row has more the 3 max values, it is necessary to save only 3 column names with the smallest indices. So the final output should be look like this DF:
x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 Max1 ValMax1 Max2 ValMax2 Max3 ValMax3
0 0 1 2 2 0 0 0 0 0 x_4 2 x_5 2 x_3 1
4 4 0 4 4 1 0 0 0 0 x_1 4 x_2 4 x_4 4
0 0 1 2 3 0 0 0 0 0 x_5 3 x_4 2 x_3 1
So summarized we have the next result: in the first row 4 < 5, it means 4 comes first (anyway the second 2 comes immediately in the next column). in the second row 1 < 2 < 4 < 5, we have only 3 columns, so 5 is missing in the final result. in the third row, indices don't play any role, because we have strictly different values in the row. This is also the main condition.
For an efficient approach you need to vectorize, for that use numpy's argpartition and indexing:
import numpy as np
N = 3
# convert to arrays
# and reverse to preserve order
# of min index in case of a tie
cols = df.columns.to_numpy()[::-1]
a = df.loc[:, ::-1].to_numpy()
# get the top N indices
idx = np.argpartition(a, -N)[:, :-N-1:-1]
# get the top names
names = cols[idx]
# get the top values
values = np.take_along_axis(a, idx, axis=1)
# or
values = a[np.arange(len(a))[:,None], idx]
# assign to new columns
df[[f'{x}{i+1}' for i in range(N) for x in ['Max', 'ValMax']]
] = (np.dstack([names, values])
.reshape(len(df), -1)
)
Output:
x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 Max1 ValMax1 Max2 ValMax2 Max3 ValMax3
0 0 0 1 2 2 0 0 0 0 0 x_4 2 x_5 2 x_3 1
1 4 4 0 4 4 1 0 0 0 0 x_1 4 x_2 4 x_4 4
2 0 0 1 2 3 0 0 0 0 0 x_5 3 x_4 2 x_3 1
It seems to make more sense to do this in NumPy, then get the column names at the end.
I wrote a function you can use to get the top n indexes of an array. It works by using np.nanargmax then masking the values as NaN before doing it again. (There may be better ways to do this, but this is just what I thought of first.)
def argmax_n(arr: np.array, n: int, axis=None):
arr = arr.astype('float')
argmaxes = []
for _ in range(n):
argmax = np.nanargmax(arr, axis=axis, keepdims=True)
argmaxes.append(argmax)
np.put_along_axis(arr, argmax, np.NAN, axis=axis)
return argmaxes
Used like so:
a = df.to_numpy()
argmax_3 = argmax_n(a, 3, axis=1)
Then you can build the DataFrame you want and .join it with the original if needed.
max_data = {}
for i, arg in enumerate(argmax_3, start=1):
max_data[f'Max{i}'] = df.columns[arg.flatten()]
max_data[f'ValMax{i}'] = np.take_along_axis(a, arg, axis=1).flatten()
pd.DataFrame(max_data)
Max1 ValMax1 Max2 ValMax2 Max3 ValMax3
0 x_4 2 x_5 2 x_3 1
1 x_1 4 x_2 4 x_4 4
2 x_5 3 x_4 2 x_3 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With