i've got a pd.DataFrame with four columns
df = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2]
, 'A':['H','H','E','E','H','E','E','H','H']
, 'B':[4,5,2,7,6,1,3,1,0]
, 'C':['M','D','M','D','M','M','M','D','D']})
id A B C
0 1 H 4 M
1 1 H 5 D
2 1 E 2 M
3 1 E 7 D
4 1 H 6 M
5 2 E 1 M
6 2 E 3 M
7 2 H 1 D
8 2 H 0 D
I'd like to group by id and get the value of B for the nth (let's say second) occurrence of A = 'H' for each id in agg_B1 and value of B for the nth (let's say first) occurrence of C='M':
desired output:
id agg_B1 agg_B2
0 1 5 4
1 2 0 1
desired_output = df.groupby('id').agg(
agg_B1= ('B',lambda x:x[df.loc[x.index].loc[df.A== 'H'][1]])
, agg_B2= ('B',lambda x:x[df.loc[x.index].loc[df.C== 'M'][0]])
).reset_index()
TypeError: Indexing a Series with DataFrame is not supported, use the appropriate DataFrame column
Obviously, I'm doing something wrong with the indexing.
Edit: if possible, I'd like to use aggregate with lambda function, because there are multiple aggregate outputs of other sorts that I'd like to extract at the same time.
Your solution is possible change if need GroupBy.agg:
desired_output = df.groupby('id').agg(
agg_B1= ('B',lambda x:x[df.loc[x.index, 'A']== 'H'].iat[1]),
agg_B2= ('B',lambda x:x[df.loc[x.index, 'C']== 'M'].iat[0])
).reset_index()
print (desired_output)
id agg_B1 agg_B2
0 1 5 4
1 2 0 1
But if performance is important and also not sure if always exist second value matched H for first condition I suggest processing each condition separately and last add to original aggregated values:
#some sample aggregations
df0 = df.groupby('id').agg({'B':'sum', 'C':'last'})
df1 = df[df['A'].eq('H')].groupby("id")['B'].nth(1).rename('agg_B1')
df2 = df[df['C'].eq('M')].groupby("id")['B'].first().rename('agg_B2')
desired_output = pd.concat([df0, df1, df2], axis=1)
print (desired_output)
B C agg_B1 agg_B2
id
1 24 M 5 4
2 5 D 0 1
EDIT1: If need GroupBy.agg is possible test if failed indexing and then add missing value:
#for second value in sample working nice
def f1(x):
try:
return x[df.loc[x.index, 'A']== 'H'].iat[1]
except:
return np.nan
desired_output = df.groupby('id').agg(
agg_B1= ('B',f1),
agg_B2= ('B',lambda x:x[df.loc[x.index, 'C']== 'M'].iat[0])
).reset_index()
print (desired_output)
id agg_B1 agg_B2
0 1 5 4
1 2 0 1
#third value not exist so added missing value NaN
def f1(x):
try:
return x[df.loc[x.index, 'A']== 'H'].iat[2]
except:
return np.nan
desired_output = df.groupby('id').agg(
agg_B1= ('B',f1),
agg_B2= ('B',lambda x:x[df.loc[x.index, 'C']== 'M'].iat[0])
).reset_index()
print (desired_output)
id agg_B1 agg_B2
0 1 6.0 4
1 2 NaN 1
What working same like:
df1 = df[df['A'].eq('H')].groupby("id")['B'].nth(2).rename('agg_B1')
df2 = df[df['C'].eq('M')].groupby("id")['B'].first().rename('agg_B2')
desired_output = pd.concat([df1, df2], axis=1)
print (desired_output)
agg_B1 agg_B2
id
1 6.0 4
2 NaN 1
Filter for rows where A equals H, then grab the second row with the nth function :
df.query("A=='H'").groupby("id").nth(1)
A B
id
1 H 5
2 H 0
Python works on a zero based notation, so row 2 will be nth(1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With