Note: This question is inspired by the ideas discussed in this other post: DataFrame algebra in Pandas
Say I have two dataframes A and B and that for some column col_name, their values are:
A[col_name] | B[col_name]
--------------| ------------
1 | 3
2 | 4
3 | 5
4 | 6
I want to compute the set difference between A and B based on col_name. The result of this operation should be:
The rows of A where A[col_name] didn't match any entries in B[col_name].
Below is the result for the above example (showing other columns of A as well):
A[col_name] | A[other_column_1] | A[other_column_2]
------------+-------------------|------------------
1 | 'foo' | 'xyz' ....
2 | 'bar' | 'abc'
Keep in mind that some entries in A[col_name] and B[col_name] could hold the value np.NaN. I would like to treat those entries as undefined BUT different, i.e. the set difference should return them.
How can I do this in Pandas? (generalizing to a difference on multiple columns would be great as well)
One way is to use the Series isin method:
In [11]: df1 = pd.DataFrame([[1, 'foo'], [2, 'bar'], [3, 'meh'], [4, 'baz']], columns = ['A', 'B'])
In [12]: df2 = pd.DataFrame([[3, 'a'], [4, 'b']], columns = ['A', 'C'])
Now you can check whether each item in df1['A'] is in of df2['A']:
In [13]: df1['A'].isin(df2['A'])
Out[13]:
0 False
1 False
2 True
3 True
Name: A, dtype: bool
In [14]: df1[~df1['A'].isin(df2['A'])] # not in df2['A']
Out[14]:
A B
0 1 foo
1 2 bar
I think this does what you want for NaNs too:
In [21]: df1 = pd.DataFrame([[1, 'foo'], [np.nan, 'bar'], [3, 'meh'], [np.nan, 'baz']], columns = ['A', 'B'])
In [22]: df2 = pd.DataFrame([[3], [np.nan]], columns = ['A'])
In [23]: df1[~df1['A'].isin(df2['A'])]
Out[23]:
A B
0 1.0 foo
1 NaN bar
3 NaN baz
Note: For large frames it may be worth making these columns an index (to perform the join as discussed in the other question).
One way to merge on two or more columns is to use a dummy column:
In [31]: df1 = pd.DataFrame([[1, 'foo'], [np.nan, 'bar'], [4, 'meh'], [np.nan, 'eurgh']], columns = ['A', 'B'])
In [32]: df2 = pd.DataFrame([[np.nan, 'bar'], [4, 'meh']], columns = ['A', 'B'])
In [33]: cols = ['A', 'B']
In [34]: df2['dummy'] = df2[cols].isnull().any(1) # rows with NaNs in cols will be True
In [35]: merged = df1.merge(df2[cols + ['dummy']], how='left')
In [36]: merged
Out[36]:
A B dummy
0 1 foo NaN
1 NaN bar True
2 4 meh False
3 NaN eurgh NaN
The booleans were present in df2, the True has an NaN in one of the merging columns. Following your spec, we should drop those which are False:
In [37]: merged.loc[merged.dummy != False, df1.columns]
Out[37]:
A B
0 1 foo
1 NaN bar
3 NaN eurgh
Inelegant.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With