I have a data frame
            ID  CAT    SCORE
0            0    0  8325804
1            0    1  1484405
...        ...  ...      ...
1999980  99999    0  4614037
1999981  99999    1  1818470
Where I group the data by ID and want to know the 2 categories per ID with the highest score. I can see two solutions to that:
df2 = df.groupby('ID').apply(lambda g: g.nlargest(2, columns='SCORE'))
or manually converting it to a list of tuples, sorting the tuples, removing for each ID except for two and then converting back to a dataframe. The first one should be way faster than the second one, but I observe that the manual solution is WAY faster.
Why is manual nlargest faster than pandas solution?
import numpy as np
import pandas as pd
import time
def create_df(n=10**5, categories=20):
    np.random.seed(0)
    df = pd.DataFrame({'ID': [id_ for id_ in range(n) for c in range(categories)],
                       'CAT': [c for id_ in range(n) for c in range(categories)],
                       'SCORE': np.random.randint(10**7, size=n * categories)})
    return df
def are_dfs_equal(df1, df2):
    columns = sorted(df1.columns)
    if len(df1.columns) != len(df2.columns):
        return False
    elif not all(el1 == el2 for el1, el2 in zip(columns, sorted(df2.columns))):
        return False
    df1_list = [tuple(x) for x in df1[columns].values]
    df1_list = sorted(df1_list, reverse=True)
    df2_list = [tuple(x) for x in df2[columns].values]
    df2_list = sorted(df2_list, reverse=True)
    is_same = df1_list == df2_list
    return is_same
def manual_nlargest(df, n=2):
    df_list = [tuple(x) for x in df[['ID', 'SCORE', 'CAT']].values]
    df_list = sorted(df_list, reverse=True)
    l = []
    current_id = None
    current_id_count = 0
    for el in df_list:
        if el[0] != current_id:
            current_id = el[0]
            current_id_count = 1
        else:
            current_id_count += 1
        if current_id_count <= n:
            l.append(el)
    df = pd.DataFrame(l, columns=['ID', 'SCORE', 'CAT'])
    return df
df = create_df()
t0 = time.time()
df2 = df.groupby('ID').apply(lambda g: g.nlargest(2, columns='SCORE'))
t1 = time.time()
print('nlargest solution: {:0.2f}s'.format(t1 - t0))
t0 = time.time()
df3 = manual_nlargest(df, n=2)
t1 = time.time()
print('manual nlargest solution: {:0.2f}s'.format(t1 - t0))
print('is_same: {}'.format(are_dfs_equal(df2, df3)))
gives
nlargest solution: 97.76s
manual nlargest solution: 4.62s
is_same: True
I guess you can use this :
df.sort_values(by=['SCORE'],ascending=False).groupby('ID').head(2)
This is the same as your manual solution using Sort/head functions on pandas groupby.
t0 = time.time()
df4 = df.sort_values(by=['SCORE'],ascending=False).groupby('ID').head(2)
t1 = time.time()
df4_list = [tuple(x) for x in df4[['ID', 'SCORE', 'CAT']].values]
df4_list = sorted(df4_list, reverse=True)
is_same = df3_list == df4_list
print('SORT/HEAD solution: {:0.2f}s'.format(t1 - t0))
print(is_same)
gives
SORT/HEAD solution: 0.08s
True
timeit
77.9 ms ± 7.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each).
As to why nlargest is slower than the other solutions ?, I guess calling it for each group is creating an overhead (%prun is showing 15764409 function calls (15464352 primitive calls) in 30.293 seconds).
For this solution (1533 function calls (1513 primitive calls) in 0.078 seconds)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With