Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Merging two df's with different amount of rows

I have two dataframes that both have a column that can have the same number/value in it. One 'small df with ~300 rows (which is my leading file) and 1 df with ~ 5000 rows. I want to merge on 1 column but I cannot get the same amount of rows when I print the data.

first (small) dataframe (left):

import pandas as pd
df1 = pd.read_excel('./file.xlsx')
df1 = df.replace(' ', np.nan)
df1.head()
 
         col1       

row1     123456                         
row2     123457                        
row3     123458                             
row4     123459                            
row5     123450                            

second (big) df (right):

import pandas as pd
df2 = pd.read_excel('./file2.xlsx')
df2 = df.replace(' ', np.nan)
df2
 
         col1       col2      

row1     123456     hello1                    
row2     123457     hello2                    
row3     123458     hello3                        
row4     123459     hello4                        
row5     123450     hello4 
row7     555555     street1
row8     666666     street1 
row9     777777     street1        

I tried:

merged = pd.merge(left=df1, right=df2, how='inner', left_on='col1', right_on='col1')
print("Orginele data", len(df1))
print("Merged data", len(df2))

When I print I get like 30k rows in the left df but I only want to see the rows used in the left df (~300 rows). Most of them are NaN's. I tried changing the 'how=' but that did not work. I also checked the post "Merging 101" but can't seem to figure this out.

Expected result in left (small) dataframe:

         col1       col2      

row1     123456     hello1                    
row2     123457     hello2                    
row3     123458     hello3                        
row4     123459     hello4                        
row5     123450     hello4 

Appreciate the help and effort. Thank you!

like image 247
Deidraak Avatar asked Oct 21 '25 08:10

Deidraak


2 Answers

Try dataframe.join you can specify how='left which is by default

import pandas as pd
df = pd.DataFrame({"a": [0,0,1,1,2,2,2,]})
df2 = pd.DataFrame({"a": [0, 1,2,3,4,5,6,7,8,9], "b": list("abcdefghij")})
df.join(df2, on="a", lsuffix="df_a", rsuffix="df_b")

# output
  adf_a adf_b   b
0   0   0       a
1   0   0       a
2   1   1       b
3   1   1       b
4   2   2       c
5   2   2       c
6   2   2       c
like image 131
Deepak Tripathi Avatar answered Oct 23 '25 21:10

Deepak Tripathi


I think you need to try a left join instead of inner. You can read the documentation here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

like image 44
mtm1186 Avatar answered Oct 23 '25 23:10

mtm1186



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!