I have a dataframe with a person's name as the index (can have multiple entries) and two columns 'X' and 'Y'. The columns 'X' and 'Y' can be any letter between A-C.
for example:
df = pd.DataFrame({'X' : ['A', 'B', 'A', 'C'], 'Y' : ['B', 'A', 'A', 'C']},index = ['Bob','Bob','John','Mike'])
For each person (i.e. index) I would like to get the number of occurrences of every unique combination of columns 'X' and 'Y' (for example - for Bob I have 1 count of ('A','B') and 1 count of ('B','A')).
When I do the following:
df.loc['Bob'].groupby(['X','Y']).size()
I get the correct results for Bob. How can I do this for each person without al oop? Ideally, I would get a dataframe with the different people as index, every unique combination of columns 'X' and 'Y' as the columns and the number of times it appeared in the dataframe as the value.
('A','A') ('A','B') ('A','C') ('B','A') ... ('C','C')
Bob 0 1 0 1 0
John 1 0 0 0 0
Mike 0 0 0 0 1
using get_dummies and groupby
pd.get_dummies(df.apply(tuple, 1)).groupby(level=0).sum()
(A, A) (A, B) (B, A) (C, C)
Bob 0 1 1 0
John 1 0 0 0
Mike 0 0 0 1
I think you can use:
#convert columns X and Y to tuples
df['tup'] = list(zip(df.X, df.Y))
#get size and reshape
df1 = df.reset_index().groupby(['index','tup']).size().unstack(fill_value=0)
print (df1)
tup (A, A) (A, B) (B, A) (C, C)
index
Bob 0 1 1 0
John 1 0 0 0
Mike 0 0 0 1
#get all unique combination
from itertools import product
comb = list(product(df.X.unique(), df.Y.unique()))
print (comb)
[('A', 'B'), ('A', 'A'), ('A', 'C'), ('B', 'B'), ('B', 'A'),
('B', 'C'), ('C', 'B'), ('C', 'A'), ('C', 'C')]
#reindex columns by this combination
print (df1.reindex(columns=comb, fill_value=0))
tup (A, B) (A, A) (A, C) (B, B) (B, A) (B, C) (C, B) (C, A) (C, C)
index
Bob 1 0 0 0 1 0 0 0 0
John 0 1 0 0 0 0 0 0 0
Mike 0 0 0 0 0 0 0 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With