I'm trying to combine two different dataframes I've imported in python with pandas. They are the results of some eye-tracking I've done. One of them however contains both the class and method the user has gazed upon, meaning that for every row dataframe1 has dataframe2 has an extra one. Now this doesn't happen in every row so I can't just duplicate the rows, but what I was thinking was to add another row every time the index of dataframe2 had two of the same indices. Kinda like this:
dataframe1 = pd.DataFrame({'index':[1,2,3],'a':['asd','fgh','qwe'],'b':['dsa','hgf','ewq'],'c':['sad','gfh','wqe']})
dataframe1=dataframe1[['index','a','b','c']]
dataframe1
index a b c
0 1 asd dsa sad
1 2 fgh hgf gfh
2 3 qwe ewq wqe
dataframe2 = pd.DataFrame({'index':[1,1,2,3,3],'d':['zxc','cxz','xzc','zxc','xcz']})
dataframe2=dataframe2[['index','d']]
dataframe2
index d
0 1 zxc
1 1 cxz
2 2 xzc
3 3 zxc
4 3 xcz
Expected Result:
index, a, b, c, d
1, asd, dsa, sad, zxc
1, nan, nan, nan, cxz
2, fgh, hgf, gfh, xzc
3, qwe, ewq, wqe, zxc
3, nan, nan, nan, xcz
Any built in functions to use? The values can also just be the values of the previous line with the same index.
Use pd.merge
with an additional cumcounted column:
u = df2.assign(cnt=df2.groupby('index').cumcount())
v = df.assign(cnt=df.groupby('index').cumcount())
u.merge(v, on=['index', 'cnt'], how='left').drop('cnt', 1)
index d a b c
0 1 zxc asd dsa sad
1 1 cxz NaN NaN NaN
2 2 xzc fgh hgf gfh
3 3 zxc qwe ewq wqe
4 3 xcz NaN NaN NaN
Details
We introduce cumulative counts for the duplicate values in "index".
u = df2.assign(cnt=df2.groupby('index').cumcount())
u
index d cnt
0 1 zxc 0
1 1 cxz 1
2 2 xzc 0
3 3 zxc 0
4 3 xcz 1
v = df.assign(cnt=df.groupby('index').cumcount())
v
index a b c cnt
0 1 asd dsa sad 0
1 2 fgh hgf gfh 0
2 3 qwe ewq wqe 0
We then force a LEFT JOIN wrt u
on "index" and "cnt". This way, NaNs are introduced int the result:
u.merge(v, on=['index', 'cnt'], how='left')
index d cnt a b c
0 1 zxc 0 asd dsa sad
1 1 cxz 1 NaN NaN NaN
2 2 xzc 0 fgh hgf gfh
3 3 zxc 0 qwe ewq wqe
4 3 xcz 1 NaN NaN NaN
The last step is to delete the temporary "cnt" column.
Using merge
with mask
and duplicated
:
df = df2.merge(df1)
cols = ['index','a','b','c']
df[['a','b','c']] = df[cols].mask(df[cols].duplicated())[['a','b','c']]
print(df)
index d a b c
0 1 zxc asd dsa sad
1 1 cxz NaN NaN NaN
2 2 xzc fgh hgf gfh
3 3 zxc qwe ewq wqe
4 3 xcz NaN NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With