Set differences on columns between dataframes

Question

Note: This question is inspired by the ideas discussed in this other post: DataFrame algebra in Pandas

Say I have two dataframes A and B and that for some column col_name, their values are:

A[col_name]   |  B[col_name]  
--------------| ------------
1             |  3
2             |  4
3             |  5
4             |  6

I want to compute the set difference between A and B based on col_name. The result of this operation should be:

The rows of A where A[col_name] didn't match any entries in B[col_name].

Below is the result for the above example (showing other columns of A as well):

A[col_name] | A[other_column_1] | A[other_column_2]  
------------+-------------------|------------------ 
1           |    'foo'          |  'xyz'            ....
2           |    'bar'          |  'abc'

Keep in mind that some entries in A[col_name] and B[col_name] could hold the value np.NaN. I would like to treat those entries as undefined BUT different, i.e. the set difference should return them.

How can I do this in Pandas? (generalizing to a difference on multiple columns would be great as well)

Andy Hayden · Accepted Answer

One way is to use the Series isin method:

In [11]: df1 = pd.DataFrame([[1, 'foo'], [2, 'bar'], [3, 'meh'], [4, 'baz']], columns = ['A', 'B'])

In [12]: df2 = pd.DataFrame([[3, 'a'], [4, 'b']], columns = ['A', 'C'])

Now you can check whether each item in df1['A'] is in of df2['A']:

In [13]: df1['A'].isin(df2['A'])
Out[13]:
0    False
1    False
2     True
3     True
Name: A, dtype: bool

In [14]: df1[~df1['A'].isin(df2['A'])]  # not in df2['A']
Out[14]:
   A    B
0  1  foo
1  2  bar

I think this does what you want for NaNs too:

In [21]: df1 = pd.DataFrame([[1, 'foo'], [np.nan, 'bar'], [3, 'meh'], [np.nan, 'baz']], columns = ['A', 'B'])

In [22]: df2 = pd.DataFrame([[3], [np.nan]], columns = ['A'])

In [23]: df1[~df1['A'].isin(df2['A'])]
Out[23]:
    A     B
0 1.0   foo
1 NaN   bar
3 NaN   baz

Note: For large frames it may be worth making these columns an index (to perform the join as discussed in the other question).

More generally

One way to merge on two or more columns is to use a dummy column:

In [31]: df1 = pd.DataFrame([[1, 'foo'], [np.nan, 'bar'], [4, 'meh'], [np.nan, 'eurgh']], columns = ['A', 'B'])

In [32]: df2 = pd.DataFrame([[np.nan, 'bar'], [4, 'meh']], columns = ['A', 'B'])

In [33]: cols = ['A', 'B']

In [34]: df2['dummy'] = df2[cols].isnull().any(1)  # rows with NaNs in cols will be True

In [35]: merged = df1.merge(df2[cols + ['dummy']], how='left')

In [36]: merged
Out[36]:
    A      B  dummy
0   1    foo    NaN
1 NaN    bar   True
2   4    meh  False
3 NaN  eurgh    NaN

The booleans were present in df2, the True has an NaN in one of the merging columns. Following your spec, we should drop those which are False:

In [37]: merged.loc[merged.dummy != False, df1.columns]
Out[37]:
    A      B
0   1    foo
1 NaN    bar
3 NaN  eurgh

Inelegant.

Set differences on columns between dataframes

Tags:

python

pandas

Josh

1 Answers

More generally

Andy Hayden

Recent Activity

Donate For Us

Set differences on columns between dataframes

Tags:

python

pandas

Josh

1 Answers

More generally

Andy Hayden

Related questions

Recent Activity

Donate For Us