How do I efficiently hash a Polars dataframe?

Question

I am trying to implement some caching logic on a function that acts on a Polars dataframe (this is all in Python).

To avoid needlessly re-computing the result, it'd be great if I could quickly check if the dataframe has changed - ie. a hash comparison.

I am currently using:

_my_hash = df.hash_rows().sum() # int

But curious to know if there are better options.

Gustavo Bezerra · Accepted Answer

Here's a another solution which takes @SlyFox's answer and adapts it to use the stdlib hashlib and take the schema into account:

import hashlib

def hash_df(df: pl.DataFrame) -> str:
    hasher = hashlib.sha1()
    for c, t in df.schema.items():
        hasher.update(c.encode())
        hasher.update(str(t).encode())
    for h in df.hash_rows():
        hasher.update(h.to_bytes(64))
    return hasher.hexdigest()

You can easily change the hashing function to whatever is available in hashlib without installing any third-party library.

Using df.hash_rows().sum() the following would all have the same hash:

DataFrames with the same rows but in different order
DataFrames with the same content but different column names
DataFrames with different schemas that have the same binary representation (e.g. 42 as pl.Int8 or pl.Int64)

Example:

df = pl.DataFrame({
    "foo": [1, 2, 3],
    "bar": ["a", "b", "c"]
})

shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ a   │
│ 2   ┆ b   │
│ 3   ┆ c   │
└─────┴─────┘

And taking the hash of the following variants

dfs = [
    df,
    df.reverse(),
    df.rename({"foo": "FOO"}),
    df.with_columns(pl.col("foo").cast(pl.Int8)),
]

for df in dfs:
    print(df.hash_rows().sum())

We get the same hash in all cases (using Polars v1.9.0):

10223265008601598624
10223265008601598624
10223265008601598624
10223265008601598624

Using the proposed hash_df above, we get different hashes in all the variants:

for df in dfs:
    print(hash_df(df))

7163cc715a84c3cc2297c8d41182b70855a27b4f
dffdef4f683065b2e02f08c77db7180eef24b798
70d782775a985fdf4d6872b464e31698eb496905
0677b8984f67884d4b54e2405d7892814ea9e07f

An alternative, less efficient approach that doesn't use hash_rows is simply serializing the DataFrame into Parquet and taking its hash:

import io
import hashlib

def hash_df2(df: pl.DataFrame):
    with io.BytesIO() as buf:
        df.write_parquet(buf)
        buf.seek(0)
        return hashlib.sha1(buf.read()).hexdigest()

From some simple benchmarking, hash_df2 is about 3-4x slower than hash_df, so definitely not the most efficient approach.

SlyFox · Answer

It depends on what you need from a hash, but in my case I needed a hash to enable caching of jobs. This function worked well for that:

import mmh3
import polars as pl


def hash_dataframe(df: pl.DataFrame, seed=42) -> str:
    """Hash a polars DataFrame. Due to the behaviour of pl.DataFrame.hash_rows 
    this will only be consistent given a polars version.
    Args:
        df (pl.DataFrame): polars DataFrame to be hashed.
        seed (int, optional): Seed for the hash function.
    Returns:
        str: Hash of the polars DataFrame.
    """
    row_hashes = df.hash_rows(seed=seed)
    hasher = mmh3.mmh3_x64_128(seed=seed)
    for row_hash in row_hashes:
        hasher.update(row_hash.to_bytes(64, "little"))
    return hasher.digest().hex()

How do I efficiently hash a Polars dataframe?

Tags:

python

python-polars

MYK

2 Answers

Gustavo Bezerra

SlyFox

Recent Activity

Donate For Us

How do I efficiently hash a Polars dataframe?

Tags:

python

python-polars

MYK

2 Answers

Gustavo Bezerra

SlyFox

Related questions

Recent Activity

Donate For Us