How do I pivot the pandas dataframe below such that the col values become columns, row values become the index, and mean of val0 becomes the values? (In some cases this is called transforming from long-format to wide-format.)
Consider a dataframe df with columns 'key', 'row', 'item', 'col', and random float values 'val0', 'val1'. I conspicuously named the columns and relevant column values to correspond with how I want to pivot them. (Setup code at bottom.)
key row item col val0 val1
0 key0 row3 item1 col3 0.81 0.04
1 key1 row2 item1 col2 0.44 0.07
2 key1 row0 item1 col0 0.77 0.01
3 key0 row4 item0 col2 0.15 0.59
4 key1 row0 item2 col1 0.81 0.64
5 key1 row2 item2 col4 0.13 0.88
6 key2 row4 item1 col3 0.88 0.39
7 key1 row4 item1 col1 0.10 0.07
8 key1 row0 item2 col4 0.65 0.02
9 key1 row2 item0 col2 0.35 0.61
10 key2 row0 item2 col1 0.40 0.85
11 key2 row4 item1 col2 0.64 0.25
12 key0 row2 item2 col3 0.50 0.44
13 key0 row4 item1 col4 0.24 0.46
14 key1 row3 item2 col3 0.28 0.11
15 key0 row3 item1 col1 0.31 0.23
16 key0 row0 item2 col3 0.86 0.01
17 key0 row4 item0 col3 0.64 0.21
18 key2 row2 item2 col0 0.13 0.45
19 key0 row2 item0 col4 0.37 0.70
How to avoid getting ValueError: Index contains duplicate entries, cannot reshape?
How do I pivot df such that the col values become columns, row values become the index, and mean of val0 are the values?
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24
How do I pivot...
... so that missing values are 0?
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24
... to do an aggregate function other than mean, like sum?
col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24
... to do more that one aggregation at a time?
sum mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24
... to aggregate over multiple 'value' columns?
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
... to subdivide by multiple columns? (item0,item1,item2..., col0,col1,col2...)
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00
... to subdivide by multiple rows: (key0,key1... row0,row1,row2...)
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00
... to aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
... to convert a DataFrame from long-to-wide by pivoting on ONLY two columns? Given:
np.random.seed([3, 1415])
df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})
df2
A B
0 a 0
1 a 11
2 a 2
3 a 11
4 b 10
5 b 10
6 b 14
7 c 7
The expected should look something like
a b c
0 0.0 10.0 7.0
1 11.0 10.0 NaN
2 2.0 14.0 NaN
3 11.0 NaN NaN
How do I flatten the multi-index to single index after pivot?
From:
1 2
1 1 2
a 2 1 1
b 2 1 0
c 1 0 0
To:
1|1 2|1 2|2
a 2 1 1
b 2 1 0
c 1 0 0
import numpy as np
import pandas as pd
from numpy.core.defchararray import add
np.random.seed([3,1415])
n = 20
cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)
df = pd.DataFrame(
add(cols, arr1), columns=cols
).join(
pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)
Why is this question not a duplicate? and more useful than the following autosuggestions:
How to pivot a dataframe in Pandas? only covers the specific case of 'Country' to row-index, values of 'Indicator' for 'Year' to multiple columns and no aggregation of values.
When pivoting a Pandas dataframe, how do I make the column names the same as in R? (flat, label in column names)
asks how to pivot in pandas like in R, i.e. autogenerate an individual column for each value of strength...
pandas pivoting a dataframe, duplicate rows asks about the syntax for pivoting multiple columns, without needing to list them all.
None of the existing questions and answers are comprehensive, so this is an attempt at a canonical question and answer that encompasses all aspects of pivoting.
Here is a list of idioms we can use to pivot
pd.DataFrame.pivot_table
groupby with more intuitive API. For many people, this is the preferred approach. And it is the intended approach by the developers.pd.DataFrame.groupby + pd.DataFrame.unstack
unstack the levels that you want to be in the column index.pd.DataFrame.set_index + pd.DataFrame.unstack
groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.pd.DataFrame.pivot
set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.pd.crosstab
pivot_table and in its purest form is the most intuitive way to perform several tasks.pd.factorize + np.bincount
pd.get_dummies + pd.DataFrame.dot
See also:
Why do I get
ValueError: Index contains duplicate entries, cannot reshape
This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys on which it is being asked to pivot. For example: Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:
df.duplicated(['row', 'col']).any()
True
So when I pivot using
df.pivot(index='row', columns='col', values='val0')
I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:
df.set_index(['row', 'col'])['val0'].unstack()
What I'm going to do for each subsequent question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.
How do I pivot
dfsuch that thecolvalues are columns,rowvalues are the index, and mean ofval0are the values?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
aggfunc='mean')
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24
aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.How do I make it so that missing values are 0?
pd.DataFrame.pivot_table
fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0.df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='mean')
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='mean').fillna(0)
Can I get something other than
mean, like maybesum?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='sum')
col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='sum').fillna(0)
Can I do more that one aggregation at a time?
Notice that for pivot_table and crosstab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc=[np.size, np.mean])
size mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
Can I aggregate over multiple value columns?
pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely
df.pivot_table(
values=['val0', 'val1'], index='row', columns='col',
fill_value=0, aggfunc='mean')
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
Can I subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00
pd.DataFrame.groupby
df.groupby(
['row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
Can I subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index=['key', 'row'], columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00
pd.DataFrame.groupby
df.groupby(
['key', 'row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
pd.DataFrame.set_index because the set of keys are unique for both rows and columns
df.set_index(
['key', 'row', 'item', 'col']
)['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
pd.DataFrame.pivot_table
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
pd.crosstab
pd.crosstab(df['row'], df['col'])
pd.factorize + np.bincount
# get integer factorization `i` and unique values `r`
# for column `'row'`
i, r = pd.factorize(df['row'].values)
# get integer factorization `j` and unique values `c`
# for column `'col'`
j, c = pd.factorize(df['col'].values)
# `n` will be the number of rows
# `m` will be the number of columns
n, m = r.size, c.size
# `i * m + j` is a clever way of counting the
# factorization bins assuming a flat array of length
# `n * m`. Which is why we subsequently reshape as `(n, m)`
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
# BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
pd.DataFrame(b, r, c)
col3 col2 col0 col1 col4
row3 2 0 0 1 0
row2 1 2 1 0 2
row0 1 0 1 2 1
row4 2 2 0 1 1
pd.get_dummies
pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
col0 col1 col2 col3 col4
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
How do I convert a DataFrame from long to wide by pivoting on ONLY two columns?
DataFrame.pivot
The first step is to assign a number to each row - this number will be the row index of that value in the pivoted result. This is done using GroupBy.cumcount:
df2.insert(0, 'count', df2.groupby('A').cumcount())
df2
count A B
0 0 a 0
1 1 a 11
2 2 a 2
3 3 a 11
4 0 b 10
5 1 b 10
6 2 b 14
7 0 c 7
The second step is to use the newly created column as the index to call DataFrame.pivot.
df2.pivot(*df2)
# df2.pivot(index='count', columns='A', values='B')
A a b c
count
0 0.0 10.0 7.0
1 11.0 10.0 NaN
2 2.0 14.0 NaN
3 11.0 NaN NaN
DataFrame.pivot_table
Whereas DataFrame.pivot only accepts columns, DataFrame.pivot_table also accepts arrays, so the GroupBy.cumcount can be passed directly as the index without creating an explicit column.
df2.pivot_table(index=df2.groupby('A').cumcount(), columns='A', values='B')
A a b c
0 0.0 10.0 7.0
1 11.0 10.0 NaN
2 2.0 14.0 NaN
3 11.0 NaN NaN
How do I flatten the multiple index to single index after
pivot
If columns type object with string join
df.columns = df.columns.map('|'.join)
else format
df.columns = df.columns.map('{0[0]}|{0[1]}'.format)
To extend @piRSquared's answer another version of Question 10
DataFrame:
d = data = {'A': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 5},
'B': {0: 'a', 1: 'b', 2: 'c', 3: 'a', 4: 'b', 5: 'a', 6: 'c'}}
df = pd.DataFrame(d)
A B
0 1 a
1 1 b
2 1 c
3 2 a
4 2 b
5 3 a
6 5 c
Output:
0 1 2
A
1 a b c
2 a b None
3 a None None
5 c None None
Using df.groupby and pd.Series.tolist
t = df.groupby('A')['B'].apply(list)
out = pd.DataFrame(t.tolist(),index=t.index)
out
0 1 2
A
1 a b c
2 a b None
3 a None None
5 c None None
Or
A much better alternative using pd.pivot_table with df.squeeze.
t = df.pivot_table(index='A',values='B',aggfunc=list).squeeze()
out = pd.DataFrame(t.tolist(),index=t.index)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With