Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare DataFrames for Different Rows regardless of row order

I have been trying to compare two dataframes to find missing rows and different rows:

Case 1: Same number of rows and different rows:

In this case, I have same number of rows, but two different rows:

dict_a = {'Values 1':[15, 2, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23], 
          'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grapes"]}
dict_b = {'Values 1':[15, 3, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23], 
          'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grape"]}

df1 = pd.DataFrame(dict_a)
df2 = pd.DataFrame(dict_b)

enter image description here

So, I can find the different row indexes using:

list(df1[~df1.isin(df2)].dropna(how = 'all').index)

This results in [1, 5].

Case 2: Different number of rows and different rows:

In this case, I have different number of rows, and two different rows.

dict_a = {'Values 1':[15, 2, 3, 24, 1], 'Values 2':[10, 7, 3, 5, 6], 
          'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana"]}
dict_b = {'Values 1':[15, 1, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23], 
          'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grape"]}

df1 = pd.DataFrame(dict_a)
df2 = pd.DataFrame(dict_b)

display(df1)
display(df2)

enter image description here

Here, first I check for missing rows. From this, I can find that it is the fifth index by comparing the dataframe with greater length with shorter length:

df2[~df2.index.isin(df1.index)]

I can also find the different rows:

df1[~df1.isin(df2)].dropna(how = 'all')

which are [1, 4].

Case 3: Interchanged rows, but same row data

dict_a = {'Values 1':[2, 15, 3, 24, 5, 16], 'Values 2':[7, 10, 3, 5, 6, 23], 
          'Values 3': ["Orange", "Apple", "Kiwi", "Cherry", "Banana", "Grape"]}
dict_b = {'Values 1':[15, 2, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23], 
          'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grape"]}

df1 = pd.DataFrame(dict_a)
df2 = pd.DataFrame(dict_b)

display(df1)
display(df2)

However, the code doesn't work for interchanged rows. It needs to check if row already exists regardless of position and return a result "there are no different rows", but instead it says rows at index [0, 1] are different.

enter image description here

So, I am looking for a solution which can compare the two dataframes regardless of row order or position, missing rows, and different rows.

So, my goal is to compare two dataframes and return "missing rows" and also "different rows" i.e. which don't completely match each other. Is there a such a function which can actually compare the rows in this way?

like image 889
cpx Avatar asked Oct 21 '25 20:10

cpx


1 Answers

Here is a great trick that I learned by accident pd.merge does a great job showing this

dict_a = {'Values 1':[2, 15, 3, 24, 5, 16], 'Values 2':[7, 10, 3, 5, 6, 23], 
          'Values 3': ["Orange", "Apple", "Kiwi", "Cherry", "Banana", "Grape"]}
dict_b = {'Values 1':[15, 2, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23], 
          'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grape"]}

df1 = pd.DataFrame(dict_a)
df2 = pd.DataFrame(dict_b)

df3 = pd.merge(df1,df2,how='outer',on=['Values 1','Values 2','Values 3'],indicator=True)
print(df3)

is your last case you can see that this solves the problem with order

   Values 1  Values 2 Values 3 _merge
0         2         7   Orange   both
1        15        10    Apple   both
2         3         3     Kiwi   both
3        24         5   Cherry   both
4         5         6   Banana   both
5        16        23    Grape   both

Now go back to an earlier case

dict_c = {'Values 1':[15, 2, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23], 
          'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Nut"]}
df4 = pd.DataFrame(dict_c)
df5 = pd.merge(df1,df4,how='outer', on=['Values 1','Values 2','Values 3'],indicator=True)
print(df5)

You get

   Values 1  Values 2 Values 3      _merge
0         2         7   Orange        both
1        15        10    Apple        both
2         3         3     Kiwi        both
3        24         5   Cherry        both
4         5         6   Banana        both
5        16        23    Grape   left_only
6        16        23      Nut  right_only

The indicators give you where they are different. Using the on variable also makes it so that you can use a subset of the columns to fix your problem.

like image 183
Paul Brennan Avatar answered Oct 23 '25 10:10

Paul Brennan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!