Group by Labels in a Series using Pandas Groupby in Python

Question

Probably a pretty beginner question, but just learning how to use Pandas in Python. I'd like to perform a groupby using Pandas that includes groups that aren't in the dataset. Say I have the following data:

State           Year         Number
Alabama         2010         2
Texas           2013         1
Mississippi     2011         6
Florida         2010         4
Alabama         2012         1
Texas           2010         8
Mississippi     2019         5
Florida         2017         3

I'd like to use groupby to sum the numbers by state then by year, but I'd like to include all 50 states (so I'd have a ton of zeros in this example).

I'm able to do it with just the states included in the dataframe no problem with:

grouped = df.groupby(['State', 'Year'])['Number'].sum()

But if I try to group by a series I have with all 50 state names like this:

states = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Northern Mariana Islands', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']
grouped = df.groupby([states, 'Year'])['Number'].sum()

I run into issues. Any simple way to do this?

Quang Hoang · Accepted Answer

You are basically looking for reindex. However, reindex is not working with MultiIndex. You can unstack and stack:

grouped = df.groupby(['State','Year']).sum()
grouped.unstack('Year', fill_value=0).reindex(states, fill_value=0).stack()

Or chain them together:

(df.groupby(['State','Year']).sum()
   .unstack('Year', fill_value=0)
   .reindex(states, fill_value=0)
   .stack()
)

Output:

              Number
State   Year        
Alabama 2010       2
        2011       0
        2012       1
        2013       0
        2017       0
...              ...
Wyoming 2011       0
        2012       0
        2013       0
        2017       0
        2019       0

Group by Labels in a Series using Pandas Groupby in Python

Tags:

python

pandas

pandas-groupby

knebraska

1 Answers

Quang Hoang

Recent Activity

Donate For Us

Group by Labels in a Series using Pandas Groupby in Python

Tags:

python

pandas

pandas-groupby

knebraska

1 Answers

Quang Hoang

Related questions

Recent Activity

Donate For Us