Suppose I have a data frame which have column x, a, b, c And I would like to aggregate over a, b, c to get a value y from a list of x via a function myfun, then duplicate the value for all rows within each window/partition.
In R in data.table this is just 1 line:  dt[,y:=myfun(x),by=list(a,b,c)].
In Python the only way I think of is do something like this:
 # To simulate rows in a data frame
 class Record:
      def __init__(self, x, a, b, c):
           self.x = x
           self.a = a
           self.b = b
           self.c = c
 # Assume we have a list of Record as df
 mykey = attrgetter('a', 'b', 'c')
 for key, group_iter in itertools.groupby(sorted(df, key=mykey), key=mykey):
     group = list(group_iter)
     y = myfun(x.x for x in group)
     for x in group:
         x.y = y
Although the logic is quite clear, I am not 100% happy with it. Is there any better approach?
I am not very familiar with pandas. Does it help in such case?
Side question: is there a category that my problem belongs to? aggregation? partition? window? This pattern happens so frequently in data analysis, there must be an existing name for it.
Use a DataFrame and its groupby method from pandas:
import pandas as pd
df = pd.DataFrame({'a': ['x', 'y', 'x', 'y'],
                   'x': [1, 2, 3, 4]})
df.groupby('a').apply(myfun)
The exact usage depends on how you wrote your function myfun.  Where the column used is static (e.g. always x) I write myfun to take the full DataFrame and subset inside the function.  However if your function is written to accept a vector (or a pandas Series), you can also select the column and apply your function to it:
df.groupby('a')['x'].apply(myfun)
FWIW, it is also often convenient to return a pd.Series object when you're using groupby.
To answer your side question, this is known as the split-apply-combine strategy of data processing. See here for more info.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With