Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Dataframe Reshape/Pivot - Duplicate Values in Index Error

Tags:

pandas

pivot

I'm using Python 2.7 with Pandas 0.19.2

I've looked at some similar questions/answers to this problem but still haven't been able to solve.

I have the following dataframe:

          Date         name      objects
0   2005-11-17         Pete        6
1   2014-02-04         Rob         3
2   2012-02-13         Rob         2
3   2004-12-16         Julia       4
4   2012-02-13         Mike        9

I want to reshape it as follows:

              Pete   Rob    Julia   Mike

2005-11-17     6     Nan     Nan    Nan
2014-02-04    Nan     3      Nan    Nan
2012-02-13    Nan     2      Nan    9
2004-12-16    Nan    Nan      4     Nan

I have been trying to run df.pivot(index='Date', columns='name', value='objects') but I get an 'index has duplicate value' error.

I thought that pivot or pivot_table work in these situations since there is essentially no 'conflict' between indexes and columns (e.g. multiple values for 'Rob' and '2014-02-04' in the dataframe).

I am not looking to aggregate - just reshape. Also in the example the dates are not in order - but that doesn't matter.

I feel this should be an easy fix but I can't see it. Can someone share some insight?

Thanks in advance.

like image 936
Windstorm1981 Avatar asked Dec 20 '25 22:12

Windstorm1981


2 Answers

It seems in your real data are duplicates, see sample:

print (df)
         Date   name  objects
0  2005-11-17   Pete        6
1  2014-02-04    Rob        3
2  2012-02-13    Rob        2
3  2004-12-16  Julia        4
4  2012-02-13   Mike        9 <-duplicates for 2012-02-13 and Mike
5  2012-02-13   Mike       18 <-duplicates for 2012-02-13 and Mike

Solution are pivot_table with some aggregate function, default is np.mean but can be changed to sum, 'meadian', first, last.

df = df.pivot_table(index='Date', columns='name', values='objects', aggfunc=np.mean)
print (df)
name        Julia  Mike  Pete  Rob
Date                              
2004-12-16    4.0   NaN   NaN  NaN
2005-11-17    NaN   NaN   6.0  NaN
2012-02-13    NaN  13.5   NaN  2.0 <-13.5 is mean
2014-02-04    NaN   NaN   NaN  3.0

Another solution with groupby, aggregate function and unstack:

df = df.groupby(['Date','name'])['objects'].mean().unstack()
print (df)
name        Julia  Mike  Pete  Rob
Date                              
2004-12-16    4.0   NaN   NaN  NaN
2005-11-17    NaN   NaN   6.0  NaN
2012-02-13    NaN  13.5   NaN  2.0
2014-02-04    NaN   NaN   NaN  3.0

For checking duplicated is possible use duplicated with boolean indexing:

df = df[df.duplicated(subset=['Date','name'], keep=False)]
print (df)
         Date  name  objects
4  2012-02-13  Mike        9
5  2012-02-13  Mike       18
like image 143
jezrael Avatar answered Dec 22 '25 12:12

jezrael


You can use df.pivot():

In [205]: x.pivot(index='Date', columns='name', values='objects')
Out[205]:
name        Julia  Mike  Pete  Rob
Date
2004-12-16    4.0   NaN   NaN  NaN
2005-11-17    NaN   NaN   6.0  NaN
2012-02-13    NaN   9.0   NaN  2.0
2014-02-04    NaN   NaN   NaN  3.0

alternative solution:

In [207]: x.groupby(['Date','name'])['objects'].first().unstack('name')
Out[207]:
name        Julia  Mike  Pete  Rob
Date
2004-12-16    4.0   NaN   NaN  NaN
2005-11-17    NaN   NaN   6.0  NaN
2012-02-13    NaN   9.0   NaN  2.0
2014-02-04    NaN   NaN   NaN  3.0

Actually pivot_table() - does something very similar to mentioned solution

PS tested using Pandas version 0.19.2

like image 28
MaxU - stop WAR against UA Avatar answered Dec 22 '25 10:12

MaxU - stop WAR against UA



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!