I have one frame that looks like this (df):
                                 2000q1         2000q2         2000q3  
State        RegionName                                                  
New York     New York                NaN            NaN            NaN   
California   Los Angeles   207066.666667  214466.666667  220966.666667   
Illinois     Chicago       138400.000000  143633.333333  147866.666667
(notice that State,RegionName here is a MultiIndex)
and one frame that looks like this (ut):
     State    RegionName
0  Alabama        Auburn
1  Alabama      Florence
2  Alabama  Jacksonville
3  Alabama    Livingston
4  Alabama    Montevallo
So to get all rows where State,RegionName are in both dataframes, I do this:
dfut = pd.merge(df, ut, how='inner', left_index=True, right_on=['State', 'RegionName'])
That works. I now want a list of rows where rows from df frame are NOT in ut frame -- like a "NOT inner join". I am pretty sure that I need to do a LEFT join which will give me the entire df, but I am not sure how to subtract ut intersecting rows out of it. Hope its clear. Thank you
include the parameter indicator=True in your merge and query('_merge != "both"')
dfut = pd.merge(df, ut, how='outer',
                left_index=True, right_on=['State', 'RegionName'],
                indicator=True)
dfut.query('_merge != "both"')

There is a much simpler way without query
You want to do the inner merge of df and ut to get all the
'matching' indexes of df (i.e., indexes of df that are in ut)
inner_merge_index = pd.merge(df.reset_index(), ut, how='inner').set_index(['State', 'RegionName']).index
Then, you want to remove all these matched indexes from df
dfut = df.drop(list(inner_merge_index))
Note this also works with MultiIndex. To test it, you may want to expand your test df (because it has 0 intersection with ut) by adding a row that is actually preserved when doing the inner merge (added 'Alabama'/'Jacksonville' below). If you test it, you will see this is the only row that gets dropped.
df = pd.DataFrame([['New York', 'New York', 'NaN', 'NaN', 'NaN'],
                   ['California', 'Los Angeles', '207066.666667', '214466.666667', '220966.666667'],
                   ['Illinois', 'Chicago', '138400.000000', '143633.333333', '147866.666667'],
                   ['Alabama', 'Jacksonville', '98765.000000', '54321.333333', '12345.0']],
                  columns=['State', 'RegionName', '2000q1', '2000q2', '2000q3']).set_index(['State', 'RegionName'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With