I have a pandas DataFrame that has users with features (calculated from TensorFlow word embeddings). I want to be able to group by user and calculate either a mean or median of the vectorized features:
embeddings
user features
bob [-0.030460168, -0.0014596573, 0.0997446, -0.18...
bob [-0.03197706, 0.015620711, 0.05890667, -0.0402...
bob [-0.060918115, 0.07939958, 0.0333591, 0.035655...
mary [-0.012854534, 0.07733478, 0.12939823, 0.00992...
mary [-0.04184026, 0.03382166, 0.1427004, -0.204424...
I tried something like this:
df.groupby('user').agg(count=('user', lambda x: len(x)),
mean=('features', lambda x: np.mean(x)))
But it raises the following error:
Exception: Must produce aggregated value
The problem is that x is a pd.Series of numpy.arrays, assuming you want the centroid, you could use np.vstack and find the mean accross the first axis:
Setup
import numpy as np
import pandas as pd
arrays = [np.array([-0.030460168, -0.0014596573, 0.0997446, -0.18]),
np.array([-0.03197706, 0.015620711, 0.05890667, -0.0402]),
np.array([-0.060918115, 0.07939958, 0.0333591, 0.035655]),
np.array([-0.012854534, 0.07733478, 0.12939823, 0.00992]),
np.array([-0.04184026, 0.03382166, 0.1427004, -0.204424])]
users = ['bob', 'bob', 'bob', 'mary', 'mary']
df = pd.DataFrame(data={'user': users, 'features': arrays})
Code
result = df.groupby('user').agg(count=('user', lambda x: len(x)),
mean=('features', lambda x: np.vstack(x).mean(axis=0).tolist()))
print(result)
Output
count mean
user
bob 3 [-0.04111844766666667, 0.031186877899999996, 0...
mary 2 [-0.027347397, 0.055578220000000005, 0.1360493...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With