This is a follow up question to this one, where jezrael used pandas.DataFrame.groupby to increment by a factor of some hundreds the speed of a list creation. Specifically, let df be a large dataframe, then
index = list(set(df.index))
list_df = [df.loc(x) for x in index]
and
list_df = [x for i,x in df.groupby(level=0, sort=False)]
produce the same result, with the latter being more than 200 times faster than the former, even ignoring the list creation step. Why?
I would be very glad if someone could let me understand why there is such a massive performance difference. Thanks in advance!
Edit: as suggested by Alex Riley in his comment, I confirm that the tests have been run on a dataframe with non-unique and non-monotonic index.
GROUP BY is slightly faster than SELECT DISTINCT The slower the drive, the bigger the difference. This may not be a universal rule (we tested simple cases only) The difference is hardly noticeable to users.
transform with user-defined functions, Pandas is much faster with common functions like mean and sum because they are implemented in Cython. The speed differences are not small. The current version of Groupby can handle multi-dimensional outputs. It also checks if data is already sorted and runs faster if it is.
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria.
The groupby is one of the most frequently used Pandas functions in data analysis. It is used for grouping the data points (i.e. rows) based on the distinct values in the given column or columns. We can then calculate aggregated values for the generated groups.
Because your data frame is not sorted on the index, which means all the subsetting has to be done with slow vector scan and fast algorithm like binary search can not be applied; While groupby always sort the data frame by the group by variable first, you can mimic this behavior by writing a simple algorithm that sort the index and then subset to validate this:
def sort_subset(df):
    # sort index and find out the positions that separate groups
    df = df.sort_index()
    split_indices = np.flatnonzero(np.ediff1d(df.index, to_begin=1, to_end=1))
    list_df = []
    for i in range(len(split_indices)-1):
        start_index = split_indices[i]
        end_index = split_indices[i+1]
        list_df.append(df.iloc[start_index:end_index])
    return list_df
Some timing:
import pandas as pd
import numpy as np
nrow = 1000000
df = pd.DataFrame(np.random.randn(nrow), columns=['x'], index=np.random.randint(100, size=nrow))
index = list(set(df.index))
print('no of groups: ', len(index))
%timeit list_df_1 = [df.loc[x] for x in index]
#no of groups:  100
#13.6 s ± 228 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list_df_2 = [x for i, x in df.groupby(level=0, sort=False)]
#54.8 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Not as fast because my algorithm is not optimized at all but the same order of magnitude
%timeit list_df_3 = sort_subset(df)
#102 ms ± 3.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
list_df_1 = [df.loc[x] for x in index]
list_df_2 = [x for i, x in df.groupby(level=0, sort=False)]
list_df_3 = sort_subset(df)
Compare the result:
all(list_df_3[i].eq(list_df_2[i]).all().iat[0] for i in range(len(list_df_2)))
# True
You see a significant speed up if you sort the index before subsetting as well:
def sort_subset_with_loc(df):
    df = df.sort_index()
    list_df_1 = [df.loc[x] for x in index]
    return list_df_1
%timeit sort_subset_with_loc(df)
# 25.4 ms ± 897 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With