Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas custom aggregate function with condition on group, is it possible?

I have the following dataframe:

df = pd.DataFrame(
  [{'price': 22, 'weight': 1, 'product': 'banana', },
  {'price': 20, 'weight': 2, 'product': 'apple', },
  {'price': 18, 'weight': 2, 'product': 'car', },
  {'price': 100, 'weight': 1, 'product': 'toy', },
  {'price': 27, 'weight': 1, 'product': 'computer', },
  {'price': 200, 'weight': 1, 'product': 'book', },
  {'price': 200.5, 'weight': 3, 'product': 'mouse', },
  {'price': 202, 'weight': 3, 'product': 'door', },]
)

What I have to do is to group by contiguous prices where the difference between them is less than a threshold (say 2.0) or not. After that I have to apply the following aggregations ONLY on the group 'less than threshold', otherwise the group should not be aggregated:

  1. price should be the weighted average between price and weight
  2. weight should be the maximum value
  3. product should be string concatenation

What I did so far (step by step):

  1. I sorted the dataframe by prices in ascending order (to get the contiguous values)
df.sort_values(by=['price'], inplace=True)
    price  weight   product
2   18.0       2       car
1   20.0       2     apple
0   22.0       1    banana
4   27.0       1  computer
3  100.0       1       toy
5  200.0       1      book
6  200.5       3     mouse
7  202.0       3      door    
  1. Got the difference between the prices in ascesding and descending order to detect the contiguous prices
df['asc_diff'] = df['price'].diff(periods=1)
df['desc_diff'] = df['price'].diff(periods=-1).abs()
    price  weight   product  asc_diff  desc_diff
2   18.0       2       car       NaN        2.0
1   20.0       2     apple       2.0        2.0
0   22.0       1    banana       2.0        5.0
4   27.0       1  computer       5.0       73.0
3  100.0       1       toy      73.0      100.0
5  200.0       1      book     100.0        0.5
6  200.5       3     mouse       0.5        1.5
7  202.0       3      door       1.5        NaN
  1. Combined asc_diff and desc_diff columns to remove NaN and create the continous regions
df['asc_diff'] = df['asc_diff'].combine_first(df['desc_diff'])
df['asc_diff'] = df[['asc_diff', 'desc_diff']].min(axis=1).abs()
df['asc_diff'] = df['asc_diff'] <= 2.0
df = df.drop(columns=['desc_diff'])
    price  weight   product  asc_diff
2   18.0       2       car      True
1   20.0       2     apple      True
0   22.0       1    banana      True
4   27.0       1  computer     False
3  100.0       1       toy     False
5  200.0       1      book      True
6  200.5       3     mouse      True
7  202.0       3      door      True
  1. Created the groups
g = df.groupby((df['asc_diff'].shift() != df['asc_diff']).cumsum())
for k, v in g:
    print(f'[group {k}]')
    print(v)
[group 1]
   price  weight product  asc_diff
2   18.0       2     car      True
1   20.0       2   apple      True
0   22.0       1  banana      True
[group 2]
   price  weight   product  asc_diff
4   27.0       1  computer     False
3  100.0       1       toy     False
[group 3]
   price  weight product  asc_diff
5  200.0       1    book      True
6  200.5       3   mouse      True
7  202.0       3    door      True

So far so good, but when I had to aggregate comes the problems:

def product_join(x):
    return ' '.join(x)
g.agg({'weight': 'max', 'product': product_join})
           weight           product
asc_diff                          
1              2  car apple banana
2              1      computer toy
3              3   book mouse door

The problems:

  • only group 1 and 3 should be aggregated (but in the code it applys to all groups)
  • even using a custom function (e.g. product_join) I have no access to other columns values, so that I can get the weighted average prices for example.

What I want to accomplish:

  • Aggregate only groups 1 and 3 (where asc_diff is true) and keep group 2 intact
  • in price aggregate function I needed a function to access two columns (i.e. price and weight) to get the weighted average value

Thanks in advance!

like image 819
Eduardo Gomes Avatar asked Oct 27 '25 01:10

Eduardo Gomes


2 Answers

This builds off @Panwen Wang's solution, and sticking with Pandas:

Get the contiguous rows via cumsum and diff :

temp = (df
        .sort_values('price')
        .assign(group = lambda df: df.price.diff().gt(2).cumsum())
       )

temp

   price  weight   product  group
2   18.0       2       car      0
1   20.0       2     apple      0
0   22.0       1    banana      0
4   27.0       1  computer      1
3  100.0       1       toy      2
5  200.0       1      book      3
6  200.5       3     mouse      3
7  202.0       3      door      3

Create a custom function to get the weighted mean (you can alternatively use np.average, I'm only trying to avoid the apply function):

def weighted_mean(df, column_to_average, weights, by):
     df = df.copy()
     df = df.set_index(by)
     numerator = df[column_to_average].mul(df[weights]).sum(level=by)
     denominator = df[weights].sum(level=by)
     return numerator/denominator

Compute the results:

(temp
 .assign(price = lambda df: df.group.map(weighted_mean))
 .groupby('group')
 .agg(price=('price','first'), 
      weight=('weight','max'), 
      product=('product', ' '.join))
 )
 
            price  weight           product
group                                      
0       19.600000       2  car apple banana
1       27.000000       1          computer
2      100.000000       1               toy
3      201.071429       3   book mouse door
like image 77
sammywemmy Avatar answered Oct 29 '25 17:10

sammywemmy


If I'm getting it right, you want to aggregate only the groups where all of the values in the asc_diff column are True. The other groups (asc_diff == False), should not be changed.

If that's the case, starting off from what you've done so far, the solution is straightforward. You only need to create a custom apply function that will do the work for you based on the conditions you define. The custom apply function would be like this:

def custom_apply(df):
    if df['asc_diff'].all() == False:
        df = df.set_index('asc_diff')
        return df[['price', 'weight', 'product']]
    
    def wavg(x): return np.average(x, weights=df.loc[x.index, "weight"])

    df1 = df.groupby('asc_diff').agg({'price': wavg, 'weight': 'max'})
    df2 = df.groupby('asc_diff').agg({'product': ' '.join})
    return pd.concat([df1, df2], axis=1)

The main tweaks of this function are the following:

  1. You need to check the values of the column asc_diff. If all of them are False you just return the dataframe with the columns you want.
  2. Use a custom function to calculate your weighted price (wavg)
  3. Calculate your aggregations and concatenate them.

Then, you just need to apply this function in your grouped dataframe like this:

print(g.apply(custom_apply).droplevel(1))

The result will be:

               price  weight           product
asc_diff                                      
1          19.600000       2  car apple banana
2          27.000000       1          computer
2         100.000000       1               toy
3         201.071429       3   book mouse door
like image 44
Ricardo Erikson Avatar answered Oct 29 '25 18:10

Ricardo Erikson



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!