I have a DF that looks like this:
Virus Host blastRank crisprRank mashRank
0 NC_000866|1 NC_017660|1 1.0 inf inf
1 NC_000871|1 NC_017595|1 1.0 inf inf
2 NC_000872|1 NC_017595|1 1.0 inf inf
3 NC_000896|1 NC_008530|1 1.0 inf inf
4 NC_000902|1 NC_011353|1 1.0 inf inf
... ... ... ... ... ...
51935 NC_024392|1 NC_021824|1 inf inf 1.0
51936 NC_024392|1 NC_021829|1 inf inf 1.0
51937 NC_024392|1 NC_021837|1 inf inf 1.0
51938 NC_024392|1 NC_021872|1 inf inf 1.0
51939 NC_024392|1 NC_022737|1 inf inf 1.0
What I would like to have is to group this df by Virus and for each group take rows that are equal to min in each column (first row is row in which column blastRank is min, second row is row in which column crisprRank is min etc.). If there are multiple min values, then I would like to keep all columns. I also have to do it in a way, that will support more than just those 3 columns (my program have to support more than 3 numeric columns, that's why I use df[df.columns.to_list()[2:]]
This is my code and df that it produces:
df = df.groupby(['Virus'], as_index=False).apply(lambda x: x.loc[x[x.columns.to_list()[2:]].idxmin()].reset_index(drop=True))
Virus Host blastRank crisprRank mashRank
0 0 NC_000866|1 NC_017660|1 1.0 inf inf
1 NC_000866|1 NC_017660|1 1.0 inf inf
2 NC_000866|1 NC_002163|1 inf inf 1.0
1 0 NC_000871|1 NC_017595|1 1.0 inf inf
1 NC_000871|1 NC_006449|1 inf 1.0 1.0
... ... ... ... ... ...
818 1 NC_024391|1 NC_009641|1 1.0 inf inf
2 NC_024391|1 NC_003103|1 inf inf 1.0
819 0 NC_024392|1 NC_021823|1 1.0 1.0 inf
1 NC_024392|1 NC_021823|1 1.0 1.0 inf
2 NC_024392|1 NC_003212|1 inf inf 1.0
As you can see, the idxmin() returns only the first min value. I would like to do something like idxmin(keep='all') to get all the ties.
I think you need test minimal values per groups for all ties:
cols = df.columns.to_list()[2:]
f = lambda x: x.apply(lambda x: x[x == x.min()].reset_index(drop=True))
df = df.groupby(['Virus'])[cols].apply(f)
If need all values in original order:
cols = df.columns.to_list()[2:]
f = lambda x: x[cols].where(x[cols].eq(x[cols].min()))
df[cols] = df.groupby(['Virus'], as_index=False).apply(f)
df = df.dropna(subset=cols, how='all')
Or:
df = df.melt(['Virus','Host'])
df1 = df[df.groupby(['Virus','variable'])['value'].transform('min').eq(df['value'])].copy()
df1 = df1.pivot(['Virus','Host'],'variable','value')
print (df1)
Here's one way to solve it:
import numpy as np
import pandas as pd
from io import StringIO
data = StringIO("""
Virus Host blastRank crisprRank mashRank
0 NC_000866|1 NC_017660|1 1.0 5 8
1 NC_000866|1 NC_017595|1 2.0 4 5
2 NC_000872|1 NC_017595|1 3.0 3 10
3 NC_000872|1 NC_008530|1 4.0 0 3
4 NC_000872|1 NC_011353|1 5.0 1 -3
""")
df = pd.read_csv(data, sep='\s+').convert_dtypes()
cols_of_interest = [c for c in df.columns if c not in ['Virus', 'Host']]
def get_all_min(sdf):
sdf_min = sdf.min().to_frame().T
result = pd.concat([pd.merge(sdf, sdf_min[[c]], how='inner') for c in sdf_min.columns if c in cols_of_interest])
result = result.drop_duplicates().reset_index(drop=True)
return result
df.groupby('Virus', as_index=False).apply(get_all_min).reset_index(drop=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With