Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas combine_first with particular index columns?

Tags:

python

pandas

I'm trying to join two dataframes in pandas to have the following behavior: I want to join on a specified column, but have it so redundant columns are not added to the dataframe. This is analogous to combine_first except combine_first does not seem to take an index column optional argument. Example:

# combine df1 and df2 based on "id" column
df1 = pandas.merge(df2, how="outer", on=["id"])

The problem with the above is that columns common to df1/df2 aside from "id" will be added twice (with _x,_y prefixes) to df1. How can I do something like:

# Do outer join from df2 to df1, matching items by "id" but not adding
# columns that are redundant (df1 takes precedence if the values disagree)
df1.combine_first(df2, on=["id"])

How can this be done?


1 Answers

If you are trying to merge columns from df2 into df1 while excluding any redundant columns, the following should work.

df1.set_index("id", inplace=True)
df2.set_index("id", inplace=True)
df3 = df1.merge(df2.ix[:,df2.columns-df1.columns], left_index=True, right_index=True, how="outer")

However this obviously will not update any values from df1 with values from df2 as it is only bringing in non-redundant columns. But since you said df1 will take precedence on any values that disagree, perhaps this will do the trick?

like image 139
bdiamante Avatar answered Dec 01 '25 02:12

bdiamante