Pandas

Question

I have a DF that looks like this:

             Virus         Host  blastRank  crisprRank  mashRank
0      NC_000866|1  NC_017660|1        1.0         inf       inf
1      NC_000871|1  NC_017595|1        1.0         inf       inf
2      NC_000872|1  NC_017595|1        1.0         inf       inf
3      NC_000896|1  NC_008530|1        1.0         inf       inf
4      NC_000902|1  NC_011353|1        1.0         inf       inf
...            ...          ...        ...         ...       ...
51935  NC_024392|1  NC_021824|1        inf         inf       1.0
51936  NC_024392|1  NC_021829|1        inf         inf       1.0
51937  NC_024392|1  NC_021837|1        inf         inf       1.0
51938  NC_024392|1  NC_021872|1        inf         inf       1.0
51939  NC_024392|1  NC_022737|1        inf         inf       1.0

What I would like to have is to group this df by Virus and for each group take rows that are equal to min in each column (first row is row in which column blastRank is min, second row is row in which column crisprRank is min etc.). If there are multiple min values, then I would like to keep all columns. I also have to do it in a way, that will support more than just those 3 columns (my program have to support more than 3 numeric columns, that's why I use df[df.columns.to_list()[2:]]

This is my code and df that it produces:

df = df.groupby(['Virus'], as_index=False).apply(lambda x: x.loc[x[x.columns.to_list()[2:]].idxmin()].reset_index(drop=True))


             Virus         Host  blastRank  crisprRank  mashRank
0   0  NC_000866|1  NC_017660|1        1.0         inf       inf
    1  NC_000866|1  NC_017660|1        1.0         inf       inf
    2  NC_000866|1  NC_002163|1        inf         inf       1.0
1   0  NC_000871|1  NC_017595|1        1.0         inf       inf
    1  NC_000871|1  NC_006449|1        inf         1.0       1.0
...            ...          ...        ...         ...       ...
818 1  NC_024391|1  NC_009641|1        1.0         inf       inf
    2  NC_024391|1  NC_003103|1        inf         inf       1.0
819 0  NC_024392|1  NC_021823|1        1.0         1.0       inf
    1  NC_024392|1  NC_021823|1        1.0         1.0       inf
    2  NC_024392|1  NC_003212|1        inf         inf       1.0

As you can see, the idxmin() returns only the first min value. I would like to do something like idxmin(keep='all') to get all the ties.

jezrael · Accepted Answer

I think you need test minimal values per groups for all ties:

cols = df.columns.to_list()[2:]

f = lambda x: x.apply(lambda x: x[x == x.min()].reset_index(drop=True))
df = df.groupby(['Virus'])[cols].apply(f)

If need all values in original order:

cols = df.columns.to_list()[2:]

f = lambda x: x[cols].where(x[cols].eq(x[cols].min()))
df[cols] = df.groupby(['Virus'], as_index=False).apply(f)
df = df.dropna(subset=cols, how='all')

Or:

df = df.melt(['Virus','Host'])
df1 = df[df.groupby(['Virus','variable'])['value'].transform('min').eq(df['value'])].copy()
df1 = df1.pivot(['Virus','Host'],'variable','value')

print (df1)

SultanOrazbayev · Answer

Here's one way to solve it:

import numpy as np
import pandas as pd
from io import StringIO

data = StringIO("""
             Virus         Host  blastRank  crisprRank  mashRank
0      NC_000866|1  NC_017660|1        1.0         5       8
1      NC_000866|1  NC_017595|1        2.0         4       5
2      NC_000872|1  NC_017595|1        3.0         3       10
3      NC_000872|1  NC_008530|1        4.0         0       3
4      NC_000872|1  NC_011353|1        5.0         1       -3
""")
df = pd.read_csv(data, sep='\s+').convert_dtypes()

cols_of_interest = [c for c in df.columns if c not in ['Virus', 'Host']]

def get_all_min(sdf):    
    sdf_min = sdf.min().to_frame().T
    result = pd.concat([pd.merge(sdf, sdf_min[[c]], how='inner') for c in sdf_min.columns if c in cols_of_interest])
    result = result.drop_duplicates().reset_index(drop=True)
    return result

df.groupby('Virus', as_index=False).apply(get_all_min).reset_index(drop=True)

Pandas - idxmin on multiple columns with keeping all ties

Tags:

python

python-3.x

bioinformatics

777moneymaker

2 Answers

jezrael

SultanOrazbayev

Recent Activity

Donate For Us