Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to summarize on different groupby combinations?

I am compiling a table of top-3 crops by county. Some counties have the same crop varieties in the same order. Other counties have the same crop varieties in a different order.

df1 = pd.DataFrame( { 
    "County" : ["Harney", "Baker", "Wheeler", "Hood River", "Wasco" , "Morrow","Union","Lake"] , 
    "Crop1" : ["grain", "melons", "melons", "apples", "pears", "raddish","pears","pears"],
    "Crop2" : ["melons","grain","grain","melons","carrots","pears","carrots","carrots"],
    "Crop3": ["apples","apples","apples","grain","raddish","carrots","raddish","raddish"],
    "Total_pop": [2000,1500,3000,1500,2000,2500,2700,2000]} )

I can do a groupby on Crop1, Crop2 and Crop3 and get the sum of total_pop:

df1_grouped=df1.groupby(['Crop1',"Crop2","Crop3"])['Total_pop'].sum().reset_index()

That gives me the total for specific combinations of the crops:

df1_grouped
apples  melons  grain   1500
grain   melons  apples  2000
melons  grain   apples  4500
pears   carrots raddish 6700
raddish pears   carrots 2500

What I would like, though, is to get the total population on different combinations of crops -- irrespective of whether the listed crop was crop1, crop2, or crop3. The desired result would be this:

apples  melons   grain    8000
pears   carrots  raddish  9200 

Thank you for any guidance.

like image 757
JamesMiller Avatar asked Dec 29 '25 04:12

JamesMiller


1 Answers

Since your data seem to guarantee 3 unique crops per country ("I am compiling a table of top-3 crops by county."), it suffices to sort the values and assign back.

import numpy as np

cols = ['Crop1', 'Crop2', 'Crop3']
df1[cols] = np.sort(df1[cols].to_numpy(), axis=1)

       County    Crop1  Crop2    Crop3  Total_pop
0      Harney   apples  grain   melons       2000
1       Baker   apples  grain   melons       1500
2     Wheeler   apples  grain   melons       3000
3  Hood River   apples  grain   melons       1500
4       Wasco  carrots  pears  raddish       2000
5      Morrow  carrots  pears  raddish       2500
6       Union  carrots  pears  raddish       2700
7        Lake  carrots  pears  raddish       2000

Then to summarize:

df1.groupby(cols).sum()

#                       Total_pop
#Crop1   Crop2 Crop3             
#apples  grain melons        8000
#carrots pears raddish       9200

The benefit is that you avoid Series.apply or .apply(axis=1). For larger DataFrames, the performance difference is noticeable:

df1 = pd.concat([df1]*10000, ignore_index=True)

cols = ['Crop1', 'Crop2', 'Crop3']
%timeit df1[cols] = np.sort(df1[cols].to_numpy(), axis=1)
#36.1 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

to_sum = ['Crop1', 'Crop2', 'Crop3']
%timeit df1[to_sum] = pd.DataFrame(df1.loc[:, to_sum].apply(set, axis=1).apply(list).values.tolist(), columns=to_sum)
#1.41 s ± 51.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 127
ALollz Avatar answered Dec 31 '25 17:12

ALollz