Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between 2 Polars dataframes

What is the best way to find the differences between 2 Polars dataframes? The equals method tells me if there is a difference, I want to find where is the difference.

Example:

import polars as pl

df1 = pl.DataFrame([
    {'id': 1,'col1': ['a',None],'col2': ['x']},
    {'id': 2,'col1': ['b'],'col2': ['y', None]},
    {'id': 3,'col1': [None],'col2': ['z']}]
)

┌─────┬─────────────┬─────────────┐
│ id  ┆ col1        ┆ col2        │
│ --- ┆ ---         ┆ ---         │
│ i64 ┆ list[str]   ┆ list[str]   │
╞═════╪═════════════╪═════════════╡
│ 1   ┆ ["a", null] ┆ ["x"]       │
│ 2   ┆ ["b"]       ┆ ["y", null] │
│ 3   ┆ [null]      ┆ ["z"]       │
└─────┴─────────────┴─────────────┘


df2 = pl.DataFrame([
    {'id': 1,'col1': ['a'],'col2': ['x']},
    {'id': 2,'col1': ['b', None],'col2': ['y', None]},
    {'id': 3,'col1': [None],'col2': ['z']}]
)

┌─────┬─────────────┬─────────────┐
│ id  ┆ col1        ┆ col2        │
│ --- ┆ ---         ┆ ---         │
│ i64 ┆ list[str]   ┆ list[str]   │
╞═════╪═════════════╪═════════════╡
│ 1   ┆ ["a"]       ┆ ["x"]       │
│ 2   ┆ ["b", null] ┆ ["y", null] │
│ 3   ┆ [null]      ┆ ["z"]       │
└─────┴─────────────┴─────────────┘

The difference in the example is for id = 1 and id = 2.

I can join the dataframes:

df1.join(df2, on='id', suffix='_df2')
┌─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ id  ┆ col1        ┆ col2        ┆ col1_df2    ┆ col2_df2    │
│ --- ┆ ---         ┆ ---         ┆ ---         ┆ ---         │
│ i64 ┆ list[str]   ┆ list[str]   ┆ list[str]   ┆ list[str]   │
╞═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ 1   ┆ ["a", null] ┆ ["x"]       ┆ ["a"]       ┆ ["x"]       │
│ 2   ┆ ["b"]       ┆ ["y", null] ┆ ["b", null] ┆ ["y", null] │
│ 3   ┆ [null]      ┆ ["z"]       ┆ [null]      ┆ ["z"]       │
└─────┴─────────────┴─────────────┴─────────────┴─────────────┘

Expected result

I would like to either:

  • add a boolean columns that shows True in the rows with a difference
  • filter and only display rows with a difference.

The example has only 2 columns, but there are more columns in the dataframe.

like image 211
Luca Avatar asked Jan 27 '26 09:01

Luca


1 Answers

Here's the filter approach

(
    df1.join(df2, on='id', suffix='_df2') 
    .filter(pl.any_horizontal(
        pl.col(x).ne_missing(pl.col(f"{x}_df2"))
            for x in df1.columns if x!='id'
        ))
)

If you wanted the bool column then you just change the filter to with_columns and add an alias.

(
    df1.join(df2, on='id', suffix='_df2')
    .with_columns(
        has_diff = pl.any_horizontal(
            pl.col(x).ne_missing(pl.col(f"{x}_df2"))
            for x in df1.columns if x!='id'
            )
        )
)

This assumes that each df has all the same columns other than 'id'.

like image 106
Dean MacGregor Avatar answered Feb 01 '26 22:02

Dean MacGregor



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!