Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any way to replace pandas pd.merge?

Tags:

python

pandas

I have two dataframe

>>df1.info()
>><class 'pandas.core.frame.DataFrame'>
  Int64Index: 2598374 entries, 3975 to 3054366
  Data columns (total 14 columns): ......
>>df2.info()
>><class 'pandas.core.frame.DataFrame'>
  Int64Index: 2520405 entries, 2066 to 2519507
  Data columns (total 5 columns): ......

I wanna inner join them. I tried pd.merge and I got memory error. Thus, I tried to do same things without pd.merge.

Example dataframe for original method (failed: memory error)

df1 = pd.DataFrame({'A': ['1', '2', '3', '4','5'],
              'B': ['1', '1', '1', '1','1'],
              'C': ['c', 'A1', 'a', 'c3','a'],
              'D': ['B1', 'B1', 'B2', 'B3','B4'],
              'E': ['3', '3', '3', '3','3'],
              'F': ['3', '4', '5', '6','7'],
              'G': ['2', '2', '2', '2','2']})

df2 = pd.DataFrame({'A': ['1', '2',  '8','4'],
              'B': ['1', '2',  '5','1'],
              'x': ['3', '3', '2','2'],
              'y': ['3', '4', '6','7'],
              'z': ['2', '2', '2','2']})
>>   df1
         A  B   C   D  E  F  G
      0  1  1   c  B1  3  3  2
      1  2  1  A1  B1  3  4  2
      2  3  1   a  B2  3  5  2
      3  4  1  c3  B3  3  6  2
      4  5  1   a  B4  3  7  2

     df2
         A  B  x  y  z
      0  1  1  3  3  2
      1  2  2  3  4  2
      2  8  5  2  6  2
      3  4  1  2  7  2

df1 = pd.merge(df1,df2,how='inner',on=['A','B']) 

>>   df1    
         A  B   C   D  E  F  G  x  y  z
      0  1  1   c  B1  3  3  2  3  3  2
      1  4  1  c3  B3  3  6  2  2  7  2

Example for new method
(1) I tried to delete rows in df1 which are not in df2 by column['A']['B'].
(2) concat x,y,z columns to df1

df1 = pd.DataFrame({'A': ['1', '2', '3', '4','5'],
              'B': ['1', '1', '1', '1','1'],
              'C': ['c', 'A1', 'a', 'c3','a'],
              'D': ['B1', 'B1', 'B2', 'B3','B4'],
              'E': ['3', '3', '3', '3','3'],
              'F': ['3', '4', '5', '6','7'],
              'G': ['2', '2', '2', '2','2']})

df2 = pd.DataFrame({'A': ['1', '2',  '8','4'],
              'B': ['1', '2',  '5','1'],
              'x': ['3', '3', '2','2'],
              'y': ['3', '4', '6','7'],
              'z': ['2', '2', '2','2']})
>>   df1
         A  B   C   D  E  F  G
      0  1  1   c  B1  3  3  2
      1  2  1  A1  B1  3  4  2
      2  3  1   a  B2  3  5  2
      3  4  1  c3  B3  3  6  2
      4  5  1   a  B4  3  7  2

     df2
         A  B  x  y  z
      0  1  1  3  3  2
      1  2  2  3  4  2
      2  8  5  2  6  2
      3  4  1  2  7  2

df1 = df1.loc[((df1['A'].isin(df2.A)) & (df1['B'].isin(df2.B)) ) ]

>>   df1    
         A  B   C   D  E  F  G  
      0  1  1   c  B1  3  3  2  
      1  2  1  A1  B1  3  4  2
      3  4  1  c3  B3  3  6  2  

however, I got a logical error and I have no idea to solve this problem. Can anyone help?

like image 942
user2775128 Avatar asked Sep 05 '25 16:09

user2775128


2 Answers

You can try concat with set_index:

df1 = pd.concat([df1.set_index(['A','B']),
                 df2.set_index(['A','B'])], axis=1, join='inner')
print (df1)              
      C   D  E  F  G  x  y  z
A B                          
1 1   c  B1  3  3  2  3  3  2
4 1  c3  B3  3  6  2  2  7  2

Or combination with boolean indexing:

df1 = df1[((df1['A'].isin(df2.A)) & (df1['B'].isin(df2.B)) ) ]
print (df1)
   A  B   C   D  E  F  G
0  1  1   c  B1  3  3  2
1  2  1  A1  B1  3  4  2
3  4  1  c3  B3  3  6  2

df2 = df2[((df2['A'].isin(df1.A)) & (df2['B'].isin(df1.B)) ) ]
print (df2)
   A  B  x  y  z
0  1  1  3  3  2
3  4  1  2  7  2

df3 = pd.concat([df1.set_index(['A','B']),
                 df2.set_index(['A','B'])], axis=1, join='inner')
print (df3)              
      C   D  E  F  G  x  y  z
A B                          
1 1   c  B1  3  3  2  3  3  2
4 1  c3  B3  3  6  2  2  7  2

If df1 after filtering is not large, use merge:

df1 = df1[((df1['A'].isin(df2.A)) & (df1['B'].isin(df2.B)) ) ]
print (df1)
   A  B   C   D  E  F  G
0  1  1   c  B1  3  3  2
1  2  1  A1  B1  3  4  2
3  4  1  c3  B3  3  6  2

df2 = df2[((df2['A'].isin(df1.A)) & (df2['B'].isin(df1.B)) ) ]
print (df2)
   A  B  x  y  z
0  1  1  3  3  2
3  4  1  2  7  2

df3 = pd.merge(df1,df2, on=['A','B']) 
print (df3)              
   A  B   C   D  E  F  G  x  y  z
0  1  1   c  B1  3  3  2  3  3  2
1  4  1  c3  B3  3  6  2  2  7  2
like image 143
jezrael Avatar answered Sep 07 '25 17:09

jezrael


There are some multi-index problems by use isin() then pd.concat.
I solved the problem by np.array_split.

step 1: split the dataframe to 5 dataframe

dfa,dfb,dfc,dfd,dfe = np.array_split(df1,5)

step 2: merge them separately

dfa = pd.merge(dfa,df2,how='inner',on=['A','B'])
dfb = pd.merge(dfb,df2,how='inner',on=['A','B'])
dfc = pd.merge(dfc,df2,how='inner',on=['A','B'])
dfd = pd.merge(dfd,df2,how='inner',on=['A','B'])
dfe = pd.merge(dfe,df2,how='inner',on=['A','B'])

step 3: append them to one dataframe

result = dfa.append([dfb, dfc,dfd,dfe])
like image 26
user2775128 Avatar answered Sep 07 '25 17:09

user2775128