With python polars, how can I compare two dataframes (like with ==), and get a comparison result per-cell, but while returning True/False on comparisons with null. By default, doing df1 == df2 results in null being in any cells where either df1 or df2 contains a null.
For example:
df1 = pl.DataFrame(
{
"a": [1, 2, 3, None, 5],
"b": [5, 4, 3, 2, None],
}
)
df2 = pl.DataFrame(
{
"a": [1, 2, 3, 1, 5],
"b": [5, 4, 30, 2, None],
}
)
print(f"df1: {df1}")
print(f"df2: {df2}")
print(f"df1 == df2: {df1 == df2}")
Results in:
df1: shape: (5, 2)
┌──────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 5 │
│ 2 ┆ 4 │
│ 3 ┆ 3 │
│ null ┆ 2 │
│ 5 ┆ null │
└──────┴──────┘
df2: shape: (5, 2)
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════╡
│ 1 ┆ 5 │
│ 2 ┆ 4 │
│ 3 ┆ 30 │
│ 1 ┆ 2 │
│ 5 ┆ null │
└─────┴──────┘
df1 == df2: shape: (5, 2)
┌──────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ bool ┆ bool │
╞══════╪═══════╡
│ true ┆ true │
│ true ┆ true │
│ true ┆ false │
│ null ┆ true │
│ true ┆ null │
└──────┴───────┘
However, I'm trying to determine how to get the following result:
df1 compared to df2: shape: (5, 2)
┌──────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ bool ┆ bool │
╞══════╪═══════╡
│ true ┆ true │
│ true ┆ true │
│ true ┆ false │
│false ┆ true │ <- false b/c cell is null in one DF, and a value in the other
│ true ┆ true │ <- bottom-right cell is true
└──────┴───────┘ because df1 and df2 have the same value (null)
It's a little more verbose that df1 == df2, but to have pure polars solution you can use eq_missing(). In addition, you can use the fact that df[col] returns column as pl.Series:
df1.select(pl.col(c).eq_missing(df2[c]) for c in df1.columns)
┌───────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ bool ┆ bool │
╞═══════╪═══════╡
│ true ┆ true │
│ true ┆ true │
│ true ┆ false │
│ false ┆ true │
│ true ┆ true │
└───────┴───────┘
You can easily check how polars implements pl.DataFrame.__eq__ under the hood. A helper function similar to the polars implementation, but relying on pl.Expr.eq_missing instead (as already mentioned in the answer above), could look as follows.
def _compare_to_other_df_missing(df, other):
"""
Compare a DataFrame with another DataFrame respecting `None == None`.
This differs from default comparison where null values are propagated.
"""
if df.columns != other.columns:
msg = "DataFrame columns do not match"
raise ValueError(msg)
if df.shape != other.shape:
msg = "DataFrame dimensions do not match"
raise ValueError(msg)
suffix = "__POLARS_CMP_OTHER"
other_renamed = other.select(pl.all().name.suffix(suffix))
combined = pl.concat([df, other_renamed], how="horizontal")
expr = [pl.col(n).eq_missing(pl.col(f"{n}{suffix}")) for n in df.columns]
return combined.select(expr)
>>> _compare_to_other_df_missing(df1, df2)
shape: (5, 2)
┌───────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ bool ┆ bool │
╞═══════╪═══════╡
│ true ┆ true │
│ true ┆ true │
│ true ┆ false │
│ false ┆ true │
│ true ┆ true │
└───────┴───────┘
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With