Pandas left join on duplicate keys but without increasing the number of columns

Question

I'm trying to combine two different dataframes I've imported in python with pandas. They are the results of some eye-tracking I've done. One of them however contains both the class and method the user has gazed upon, meaning that for every row dataframe1 has dataframe2 has an extra one. Now this doesn't happen in every row so I can't just duplicate the rows, but what I was thinking was to add another row every time the index of dataframe2 had two of the same indices. Kinda like this:

dataframe1 = pd.DataFrame({'index':[1,2,3],'a':['asd','fgh','qwe'],'b':['dsa','hgf','ewq'],'c':['sad','gfh','wqe']})
dataframe1=dataframe1[['index','a','b','c']]
dataframe1
   index    a    b    c
0      1  asd  dsa  sad
1      2  fgh  hgf  gfh
2      3  qwe  ewq  wqe

dataframe2 = pd.DataFrame({'index':[1,1,2,3,3],'d':['zxc','cxz','xzc','zxc','xcz']})
dataframe2=dataframe2[['index','d']]
dataframe2
   index    d
0      1  zxc
1      1  cxz
2      2  xzc
3      3  zxc
4      3  xcz

Expected Result:

index, a, b, c, d
1, asd, dsa, sad, zxc
1, nan, nan, nan, cxz
2, fgh, hgf, gfh, xzc
3, qwe, ewq, wqe, zxc
3, nan, nan, nan, xcz

Any built in functions to use? The values can also just be the values of the previous line with the same index.

cs95 · Accepted Answer

Use pd.merge with an additional cumcounted column:

u = df2.assign(cnt=df2.groupby('index').cumcount())
v = df.assign(cnt=df.groupby('index').cumcount())

u.merge(v, on=['index', 'cnt'], how='left').drop('cnt', 1)

   index    d    a    b    c
0      1  zxc  asd  dsa  sad
1      1  cxz  NaN  NaN  NaN
2      2  xzc  fgh  hgf  gfh
3      3  zxc  qwe  ewq  wqe
4      3  xcz  NaN  NaN  NaN

Details

We introduce cumulative counts for the duplicate values in "index".

u = df2.assign(cnt=df2.groupby('index').cumcount())
u
   index    d  cnt
0      1  zxc    0
1      1  cxz    1
2      2  xzc    0
3      3  zxc    0
4      3  xcz    1

v = df.assign(cnt=df.groupby('index').cumcount())
v
   index    a    b    c  cnt
0      1  asd  dsa  sad    0
1      2  fgh  hgf  gfh    0
2      3  qwe  ewq  wqe    0

We then force a LEFT JOIN wrt u on "index" and "cnt". This way, NaNs are introduced int the result:

u.merge(v, on=['index', 'cnt'], how='left')

   index    d  cnt    a    b    c
0      1  zxc    0  asd  dsa  sad
1      1  cxz    1  NaN  NaN  NaN
2      2  xzc    0  fgh  hgf  gfh
3      3  zxc    0  qwe  ewq  wqe
4      3  xcz    1  NaN  NaN  NaN

The last step is to delete the temporary "cnt" column.

Space Impact · Answer

Using merge with mask and duplicated:

df = df2.merge(df1)
cols = ['index','a','b','c']
df[['a','b','c']] = df[cols].mask(df[cols].duplicated())[['a','b','c']]

print(df)
   index    d    a    b    c
0      1  zxc  asd  dsa  sad
1      1  cxz  NaN  NaN  NaN
2      2  xzc  fgh  hgf  gfh
3      3  zxc  qwe  ewq  wqe
4      3  xcz  NaN  NaN  NaN

Pandas left join on duplicate keys but without increasing the number of columns

Tags:

python

pandas

dataframe

Martin Mansour

2 Answers

cs95

Space Impact

Recent Activity

Donate For Us

Pandas left join on duplicate keys but without increasing the number of columns

Tags:

python

pandas

dataframe

Martin Mansour

2 Answers

cs95

Space Impact

Related questions

Recent Activity

Donate For Us