Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas Groupby isin

I have a dataframe that lists different teams (green, blue, yellow,orange, [there are hundreds of teams]etc) and also lists their revenue on a monthly basis. I want to be able to create a list of the top 10 teams based on revenue and then feed that into a groupby statement so that I am only looking at those teams as I work through various dataframes. These are the statements I have created and that I am having trouble with:

Rev = df['Revenue'].head(10) and I have also used Rev = df.nlargest(10,['Revenue'])

grpby = df.groupby([df['team'].isin(rev), 'team'], as_index=False)['Revenue'].sum().sort_values('Revenue', ascending=False).reset_index()

*Edit: Other code leading up to this request: *Edit: df = pd.read_excel('c:/Test.xlsx', sheet_name="Sheet1", index_col = 'Date', parse_dates=True)

*Edit: df = pd.DataFrame(df)

I can make the groupby statement work, but I cannot feed in the 'Rev' list to the groupby statement that limits/filters which groups to look at.

Also, when using a groupby statement to create a dataframe, how do I add back in other columns that are not being grouped? For example, in my above statement i try to utilize 'team' and 'revenue', but if I also wanted to add in other columns like ('location' or 'team lead') what is the syntax to add in more columns?

*Edit Sample input via excel file: Teams Revenue Green 10 Blue 15 Red 20 Orange 5 In the above example, I would like to use a statement that takes the top three and saves as a list and then feed that into the groupby statement. Now it looks like i have not filled the actual dataframe? *from the console: Empty DataFrame Columns: [Team, Revenue] Index: []

like image 231
John Avatar asked Oct 28 '25 19:10

John


1 Answers

Need filter as first step by boolean indexing:

Sample:

df = pd.DataFrame({'Teams': ['Green', 'Blue', 'Red', 'Orange', 'Green', 'Blue', 'Grey', 'Purple'], 
                   'Revenue': [18, 15, 20, 5, 10, 15, 2, 5], 
                   'Location': ['A', 'B', 'V', 'G', 'A', 'D', 'B', 'C']})

print (df)
    Teams  Revenue Location
0   Green       18        A
1    Blue       15        B
2     Red       20        V
3  Orange        5        G
4   Green       10        A
5    Blue       15        D
6    Grey        2        B
7  Purple        5        C

First get top values and select column Teams:

Rev = df.nlargest(3,'Revenue')['Teams']
print (Rev)
2      Red
0    Green
1     Blue
Name: Teams, dtype: object

Then need filter first by boolean indexing:

print (df[df['Teams'].isin(Rev)])
   Teams  Revenue Location
0  Green       18        A
1   Blue       15        B
2    Red       20        V
4  Green       10        A
5   Blue       15        D

df1 = (df[df['Teams'].isin(Rev)]
        .groupby('Teams',as_index=False)['Revenue']
        .sum()
        .sort_values('Revenue', ascending=False))
print (df1)
   Teams  Revenue
0   Blue       30
1  Green       28
2    Red       20

If need multiple columns to output is necessary set aggregation function for each of them like:

df2 = (df[df['Teams'].isin(Rev)]
        .groupby('Teams',as_index=False)
        .agg({'Revenue':'sum', 'Location': ', '.join, 'Another col':'mean'}))
print (df2)
   Teams  Revenue Location
0   Blue       30     B, D
1  Green       28     A, A
2    Red       20        V
like image 93
jezrael Avatar answered Oct 30 '25 10:10

jezrael