I have the following dataframe:
df = pd.DataFrame(
[{'price': 22, 'weight': 1, 'product': 'banana', },
{'price': 20, 'weight': 2, 'product': 'apple', },
{'price': 18, 'weight': 2, 'product': 'car', },
{'price': 100, 'weight': 1, 'product': 'toy', },
{'price': 27, 'weight': 1, 'product': 'computer', },
{'price': 200, 'weight': 1, 'product': 'book', },
{'price': 200.5, 'weight': 3, 'product': 'mouse', },
{'price': 202, 'weight': 3, 'product': 'door', },]
)
What I have to do is to group by contiguous prices where the difference between them is less than a threshold (say 2.0) or not. After that I have to apply the following aggregations ONLY on the group 'less than threshold', otherwise the group should not be aggregated:
price should be the weighted average between price and weightweight should be the maximum valueproduct should be string concatenationWhat I did so far (step by step):
df.sort_values(by=['price'], inplace=True)
price weight product
2 18.0 2 car
1 20.0 2 apple
0 22.0 1 banana
4 27.0 1 computer
3 100.0 1 toy
5 200.0 1 book
6 200.5 3 mouse
7 202.0 3 door
df['asc_diff'] = df['price'].diff(periods=1)
df['desc_diff'] = df['price'].diff(periods=-1).abs()
price weight product asc_diff desc_diff
2 18.0 2 car NaN 2.0
1 20.0 2 apple 2.0 2.0
0 22.0 1 banana 2.0 5.0
4 27.0 1 computer 5.0 73.0
3 100.0 1 toy 73.0 100.0
5 200.0 1 book 100.0 0.5
6 200.5 3 mouse 0.5 1.5
7 202.0 3 door 1.5 NaN
asc_diff and desc_diff columns to remove NaN and create the continous regionsdf['asc_diff'] = df['asc_diff'].combine_first(df['desc_diff'])
df['asc_diff'] = df[['asc_diff', 'desc_diff']].min(axis=1).abs()
df['asc_diff'] = df['asc_diff'] <= 2.0
df = df.drop(columns=['desc_diff'])
price weight product asc_diff
2 18.0 2 car True
1 20.0 2 apple True
0 22.0 1 banana True
4 27.0 1 computer False
3 100.0 1 toy False
5 200.0 1 book True
6 200.5 3 mouse True
7 202.0 3 door True
g = df.groupby((df['asc_diff'].shift() != df['asc_diff']).cumsum())
for k, v in g:
print(f'[group {k}]')
print(v)
[group 1]
price weight product asc_diff
2 18.0 2 car True
1 20.0 2 apple True
0 22.0 1 banana True
[group 2]
price weight product asc_diff
4 27.0 1 computer False
3 100.0 1 toy False
[group 3]
price weight product asc_diff
5 200.0 1 book True
6 200.5 3 mouse True
7 202.0 3 door True
So far so good, but when I had to aggregate comes the problems:
def product_join(x):
return ' '.join(x)
g.agg({'weight': 'max', 'product': product_join})
weight product
asc_diff
1 2 car apple banana
2 1 computer toy
3 3 book mouse door
The problems:
What I want to accomplish:
asc_diff is true) and keep group 2 intactprice aggregate function I needed a function to access two columns (i.e. price and weight) to get the weighted average valueThanks in advance!
This builds off @Panwen Wang's solution, and sticking with Pandas:
Get the contiguous rows via cumsum and diff :
temp = (df
.sort_values('price')
.assign(group = lambda df: df.price.diff().gt(2).cumsum())
)
temp
price weight product group
2 18.0 2 car 0
1 20.0 2 apple 0
0 22.0 1 banana 0
4 27.0 1 computer 1
3 100.0 1 toy 2
5 200.0 1 book 3
6 200.5 3 mouse 3
7 202.0 3 door 3
Create a custom function to get the weighted mean (you can alternatively use np.average, I'm only trying to avoid the apply function):
def weighted_mean(df, column_to_average, weights, by):
df = df.copy()
df = df.set_index(by)
numerator = df[column_to_average].mul(df[weights]).sum(level=by)
denominator = df[weights].sum(level=by)
return numerator/denominator
Compute the results:
(temp
.assign(price = lambda df: df.group.map(weighted_mean))
.groupby('group')
.agg(price=('price','first'),
weight=('weight','max'),
product=('product', ' '.join))
)
price weight product
group
0 19.600000 2 car apple banana
1 27.000000 1 computer
2 100.000000 1 toy
3 201.071429 3 book mouse door
If I'm getting it right, you want to aggregate only the groups where all of the values in the asc_diff column are True. The other groups (asc_diff == False), should not be changed.
If that's the case, starting off from what you've done so far, the solution is straightforward. You only need to create a custom apply function that will do the work for you based on the conditions you define. The custom apply function would be like this:
def custom_apply(df):
if df['asc_diff'].all() == False:
df = df.set_index('asc_diff')
return df[['price', 'weight', 'product']]
def wavg(x): return np.average(x, weights=df.loc[x.index, "weight"])
df1 = df.groupby('asc_diff').agg({'price': wavg, 'weight': 'max'})
df2 = df.groupby('asc_diff').agg({'product': ' '.join})
return pd.concat([df1, df2], axis=1)
The main tweaks of this function are the following:
asc_diff. If all of them are False you just return the dataframe with the columns you want.wavg)Then, you just need to apply this function in your grouped dataframe like this:
print(g.apply(custom_apply).droplevel(1))
The result will be:
price weight product
asc_diff
1 19.600000 2 car apple banana
2 27.000000 1 computer
2 100.000000 1 toy
3 201.071429 3 book mouse door
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With