Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group by Labels in a Series using Pandas Groupby in Python

Probably a pretty beginner question, but just learning how to use Pandas in Python. I'd like to perform a groupby using Pandas that includes groups that aren't in the dataset. Say I have the following data:

State           Year         Number
Alabama         2010         2
Texas           2013         1
Mississippi     2011         6
Florida         2010         4
Alabama         2012         1
Texas           2010         8
Mississippi     2019         5
Florida         2017         3

I'd like to use groupby to sum the numbers by state then by year, but I'd like to include all 50 states (so I'd have a ton of zeros in this example).

I'm able to do it with just the states included in the dataframe no problem with:

grouped = df.groupby(['State', 'Year'])['Number'].sum()

But if I try to group by a series I have with all 50 state names like this:

states = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Northern Mariana Islands', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']
grouped = df.groupby([states, 'Year'])['Number'].sum()

I run into issues. Any simple way to do this?

like image 331
knebraska Avatar asked Jan 19 '26 19:01

knebraska


1 Answers

You are basically looking for reindex. However, reindex is not working with MultiIndex. You can unstack and stack:

grouped = df.groupby(['State','Year']).sum()
grouped.unstack('Year', fill_value=0).reindex(states, fill_value=0).stack()

Or chain them together:

(df.groupby(['State','Year']).sum()
   .unstack('Year', fill_value=0)
   .reindex(states, fill_value=0)
   .stack()
)

Output:

              Number
State   Year        
Alabama 2010       2
        2011       0
        2012       1
        2013       0
        2017       0
...              ...
Wyoming 2011       0
        2012       0
        2013       0
        2017       0
        2019       0
like image 93
Quang Hoang Avatar answered Jan 21 '26 08:01

Quang Hoang