I have two dataframe
>>df1.info()
>><class 'pandas.core.frame.DataFrame'>
Int64Index: 2598374 entries, 3975 to 3054366
Data columns (total 14 columns): ......
>>df2.info()
>><class 'pandas.core.frame.DataFrame'>
Int64Index: 2520405 entries, 2066 to 2519507
Data columns (total 5 columns): ......
I wanna inner join them. I tried pd.merge
and I got memory error. Thus, I tried to do same things without pd.merge
.
Example dataframe for original method (failed: memory error)
df1 = pd.DataFrame({'A': ['1', '2', '3', '4','5'],
'B': ['1', '1', '1', '1','1'],
'C': ['c', 'A1', 'a', 'c3','a'],
'D': ['B1', 'B1', 'B2', 'B3','B4'],
'E': ['3', '3', '3', '3','3'],
'F': ['3', '4', '5', '6','7'],
'G': ['2', '2', '2', '2','2']})
df2 = pd.DataFrame({'A': ['1', '2', '8','4'],
'B': ['1', '2', '5','1'],
'x': ['3', '3', '2','2'],
'y': ['3', '4', '6','7'],
'z': ['2', '2', '2','2']})
>> df1
A B C D E F G
0 1 1 c B1 3 3 2
1 2 1 A1 B1 3 4 2
2 3 1 a B2 3 5 2
3 4 1 c3 B3 3 6 2
4 5 1 a B4 3 7 2
df2
A B x y z
0 1 1 3 3 2
1 2 2 3 4 2
2 8 5 2 6 2
3 4 1 2 7 2
df1 = pd.merge(df1,df2,how='inner',on=['A','B'])
>> df1
A B C D E F G x y z
0 1 1 c B1 3 3 2 3 3 2
1 4 1 c3 B3 3 6 2 2 7 2
Example for new method
(1) I tried to delete rows in df1 which are not in df2 by column['A']['B'].
(2) concat x,y,z columns to df1
df1 = pd.DataFrame({'A': ['1', '2', '3', '4','5'],
'B': ['1', '1', '1', '1','1'],
'C': ['c', 'A1', 'a', 'c3','a'],
'D': ['B1', 'B1', 'B2', 'B3','B4'],
'E': ['3', '3', '3', '3','3'],
'F': ['3', '4', '5', '6','7'],
'G': ['2', '2', '2', '2','2']})
df2 = pd.DataFrame({'A': ['1', '2', '8','4'],
'B': ['1', '2', '5','1'],
'x': ['3', '3', '2','2'],
'y': ['3', '4', '6','7'],
'z': ['2', '2', '2','2']})
>> df1
A B C D E F G
0 1 1 c B1 3 3 2
1 2 1 A1 B1 3 4 2
2 3 1 a B2 3 5 2
3 4 1 c3 B3 3 6 2
4 5 1 a B4 3 7 2
df2
A B x y z
0 1 1 3 3 2
1 2 2 3 4 2
2 8 5 2 6 2
3 4 1 2 7 2
df1 = df1.loc[((df1['A'].isin(df2.A)) & (df1['B'].isin(df2.B)) ) ]
>> df1
A B C D E F G
0 1 1 c B1 3 3 2
1 2 1 A1 B1 3 4 2
3 4 1 c3 B3 3 6 2
however, I got a logical error and I have no idea to solve this problem. Can anyone help?
You can try concat
with set_index
:
df1 = pd.concat([df1.set_index(['A','B']),
df2.set_index(['A','B'])], axis=1, join='inner')
print (df1)
C D E F G x y z
A B
1 1 c B1 3 3 2 3 3 2
4 1 c3 B3 3 6 2 2 7 2
Or combination with boolean indexing
:
df1 = df1[((df1['A'].isin(df2.A)) & (df1['B'].isin(df2.B)) ) ]
print (df1)
A B C D E F G
0 1 1 c B1 3 3 2
1 2 1 A1 B1 3 4 2
3 4 1 c3 B3 3 6 2
df2 = df2[((df2['A'].isin(df1.A)) & (df2['B'].isin(df1.B)) ) ]
print (df2)
A B x y z
0 1 1 3 3 2
3 4 1 2 7 2
df3 = pd.concat([df1.set_index(['A','B']),
df2.set_index(['A','B'])], axis=1, join='inner')
print (df3)
C D E F G x y z
A B
1 1 c B1 3 3 2 3 3 2
4 1 c3 B3 3 6 2 2 7 2
If df1
after filtering is not large, use merge
:
df1 = df1[((df1['A'].isin(df2.A)) & (df1['B'].isin(df2.B)) ) ]
print (df1)
A B C D E F G
0 1 1 c B1 3 3 2
1 2 1 A1 B1 3 4 2
3 4 1 c3 B3 3 6 2
df2 = df2[((df2['A'].isin(df1.A)) & (df2['B'].isin(df1.B)) ) ]
print (df2)
A B x y z
0 1 1 3 3 2
3 4 1 2 7 2
df3 = pd.merge(df1,df2, on=['A','B'])
print (df3)
A B C D E F G x y z
0 1 1 c B1 3 3 2 3 3 2
1 4 1 c3 B3 3 6 2 2 7 2
There are some multi-index problems by use isin()
then pd.concat
.
I solved the problem by np.array_split
.
step 1: split the dataframe to 5 dataframe
dfa,dfb,dfc,dfd,dfe = np.array_split(df1,5)
step 2: merge them separately
dfa = pd.merge(dfa,df2,how='inner',on=['A','B'])
dfb = pd.merge(dfb,df2,how='inner',on=['A','B'])
dfc = pd.merge(dfc,df2,how='inner',on=['A','B'])
dfd = pd.merge(dfd,df2,how='inner',on=['A','B'])
dfe = pd.merge(dfe,df2,how='inner',on=['A','B'])
step 3: append them to one dataframe
result = dfa.append([dfb, dfc,dfd,dfe])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With