I have the following code with groupby function
 order_diff = order.groupby("CD_QTY").filter(
     lambda x: x["BOX_ID"].nunique() > 1
)
But the filter function is very slow
I would like to use the transform or map function. But the result with both transform and map function is different than the filter function
I have tried transform() and map()
# tranform
oder_diff = order[order.groupby("CD_QTY")["BOX_ID"].transform('nunique').gt(1)]
#using map()
order[order["CD_QTY"].map(order.groupby("CD_QTY")["BOX_ID"].nunique()).gt(1)]
I am a bit confused about why the results are different than the filter(). I would appreciate your feedback on this. Thank you.
My rule of thumb is that any pandas function that take a custom lambda will be slowed down significantly due to looping and hence perfomance-limited by the GIL.
If you care about speed, use vectorized functions as much as possible:
# Some mock data
n = 10_000
np.random.seed(42)
cd_qty = np.random.randint(1, n, n)
box_id = np.random.randint(1, 5, n)
order = pd.DataFrame({
    'CD_QTY': cd_qty,
    'BOX_ID': box_id
})
# The original solution. 1.59s
order_diff_1 = order.groupby("CD_QTY").filter(
     lambda x: x["BOX_ID"].nunique() > 1
)
# Vectorized. 8.5 ms -> 187x faster
count = order.groupby(['CD_QTY'])['BOX_ID'].nunique()
cond = order['CD_QTY'].isin(count[count > 1].index)
order_diff_2 = order.loc[cond]
# Check if they produce the same outputs
all(order_diff_1 == order_diff_2) # --> True
Testing and your both solutions are faster like accepted answer, here is winner GroupBy.transform:
# Some mock data
n = 10_000
np.random.seed(42)
cd_qty = np.random.randint(1, n, n)
box_id = np.random.randint(1, 5, n)
order = pd.DataFrame({
    'CD_QTY': cd_qty,
    'BOX_ID': box_id
})
count = order.groupby(['CD_QTY'])['BOX_ID'].nunique()
cond = order['CD_QTY'].isin(count[count > 1].index)
order_diff_2 = order.loc[cond]
order_diff = order[order.groupby("CD_QTY")["BOX_ID"].transform('nunique').gt(1)]
order_diff_3 =order[order["CD_QTY"].map(order.groupby("CD_QTY")["BOX_ID"].nunique()).gt(1)]
print (all(order_diff == order_diff_2))
True
print (all(order_diff == order_diff_3))
True
In [201]: %%timeit
     ...: count = order.groupby(['CD_QTY'])['BOX_ID'].nunique()
     ...: cond = order['CD_QTY'].isin(count[count > 1].index)
     ...: order_diff_2 = order.loc[cond]
     ...: 
4.8 ms ± 56.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [202]: %%timeit
     ...: order[order.groupby("CD_QTY")["BOX_ID"].transform('nunique').gt(1)]
     ...: 
4.02 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [203]: %%timeit
     ...: order[order["CD_QTY"].map(order.groupby("CD_QTY")["BOX_ID"].nunique()).gt(1)]
     ...: 
4.31 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With