Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame, smart apply of a complex function to groupby result

I have a pandas.DataFrame with 3 columns of type str and n other columns of type float64.

I need to group rows by one of the three str columns and apply a function myComplexFunc() which will reduce `̀N rows to one row.

myComplexFunc() take only rows of type float64.

This can be done with some for loops but it will not be efficient, So I tried to use the flexible apply of pandas but it seems that it runs the heavy code of myComplexFunc() twice!

To be more clear, here is a minimal example

Let "df" be a dataFrame like this :

df
>>
     A      B         C         D
0  foo    one  0.406157  0.735223
1  bar    one  1.020493 -1.167256
2  foo    two -0.314192 -0.883087
3  bar  three  0.271705 -0.215049
4  foo    two  0.535290  0.185872
5  bar    two  0.178926 -0.459890
6  foo    one -1.939673 -0.523396
7  foo  three -2.125591 -0.689809

myComplexFunc()

def myComplexFunc(rows):
  # Some transformations that will return 1 row
  result = some_transformations(rows)
  return result

What I want :

# wanted apply is the name of the wanted method
df.groupby("A").wanted_apply(myComplexFunc)

>> 
    A    C            D
0  foo   new_c0_foo   new_d0_foo
1  bar   new_c0_bar   new_d0_bar

The column B have been removed because it's not of type float64.

Thanks in advance

like image 452
farhawa Avatar asked Dec 06 '25 21:12

farhawa


1 Answers

You can filter DataFrame by dtype by select_dtypes, but then need aggreagate by Series df.A:

def myComplexFunc(rows):
    return rows  + 10

df = df.select_dtypes(include=[np.float64]).groupby([df.A]).apply(myComplexFunc)
print (df)
           C          D
0  10.406157  10.735223
1  11.020493   8.832744
2   9.685808   9.116913
3  10.271705   9.784951
4  10.535290  10.185872
5  10.178926   9.540110
6   8.060327   9.476604
7   7.874409   9.310191

because if use only A:

df = df.select_dtypes(include=[np.float64]).groupby('A').apply(myComplexFunc)

get

KeyError: 'A'

and it is right - all string columns are excluded (A and B).

print (df.select_dtypes(include=[np.float64]))
          C         D
0  0.406157  0.735223
1  1.020493 -1.167256
2 -0.314192 -0.883087
3  0.271705 -0.215049
4  0.535290  0.185872
5  0.178926 -0.459890
6 -1.939673 -0.523396
7 -2.125591 -0.689809
like image 104
jezrael Avatar answered Dec 08 '25 09:12

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!