Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I efficiently hash a Polars dataframe?

I am trying to implement some caching logic on a function that acts on a Polars dataframe (this is all in Python).

To avoid needlessly re-computing the result, it'd be great if I could quickly check if the dataframe has changed - ie. a hash comparison.

I am currently using:

_my_hash = df.hash_rows().sum() # int

But curious to know if there are better options.

like image 383
MYK Avatar asked Feb 04 '26 04:02

MYK


2 Answers

Here's a another solution which takes @SlyFox's answer and adapts it to use the stdlib hashlib and take the schema into account:

import hashlib

def hash_df(df: pl.DataFrame) -> str:
    hasher = hashlib.sha1()
    for c, t in df.schema.items():
        hasher.update(c.encode())
        hasher.update(str(t).encode())
    for h in df.hash_rows():
        hasher.update(h.to_bytes(64))
    return hasher.hexdigest()

You can easily change the hashing function to whatever is available in hashlib without installing any third-party library.

Using df.hash_rows().sum() the following would all have the same hash:

  • DataFrames with the same rows but in different order
  • DataFrames with the same content but different column names
  • DataFrames with different schemas that have the same binary representation (e.g. 42 as pl.Int8 or pl.Int64)

Example:

df = pl.DataFrame({
    "foo": [1, 2, 3],
    "bar": ["a", "b", "c"]
})
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ a   │
│ 2   ┆ b   │
│ 3   ┆ c   │
└─────┴─────┘

And taking the hash of the following variants

dfs = [
    df,
    df.reverse(),
    df.rename({"foo": "FOO"}),
    df.with_columns(pl.col("foo").cast(pl.Int8)),
]

for df in dfs:
    print(df.hash_rows().sum())

We get the same hash in all cases (using Polars v1.9.0):

10223265008601598624
10223265008601598624
10223265008601598624
10223265008601598624

Using the proposed hash_df above, we get different hashes in all the variants:

for df in dfs:
    print(hash_df(df))
7163cc715a84c3cc2297c8d41182b70855a27b4f
dffdef4f683065b2e02f08c77db7180eef24b798
70d782775a985fdf4d6872b464e31698eb496905
0677b8984f67884d4b54e2405d7892814ea9e07f

An alternative, less efficient approach that doesn't use hash_rows is simply serializing the DataFrame into Parquet and taking its hash:

import io
import hashlib

def hash_df2(df: pl.DataFrame):
    with io.BytesIO() as buf:
        df.write_parquet(buf)
        buf.seek(0)
        return hashlib.sha1(buf.read()).hexdigest()

From some simple benchmarking, hash_df2 is about 3-4x slower than hash_df, so definitely not the most efficient approach.

like image 185
Gustavo Bezerra Avatar answered Feb 05 '26 18:02

Gustavo Bezerra


It depends on what you need from a hash, but in my case I needed a hash to enable caching of jobs. This function worked well for that:

import mmh3
import polars as pl


def hash_dataframe(df: pl.DataFrame, seed=42) -> str:
    """Hash a polars DataFrame. Due to the behaviour of pl.DataFrame.hash_rows 
    this will only be consistent given a polars version.
    Args:
        df (pl.DataFrame): polars DataFrame to be hashed.
        seed (int, optional): Seed for the hash function.
    Returns:
        str: Hash of the polars DataFrame.
    """
    row_hashes = df.hash_rows(seed=seed)
    hasher = mmh3.mmh3_x64_128(seed=seed)
    for row_hash in row_hashes:
        hasher.update(row_hash.to_bytes(64, "little"))
    return hasher.digest().hex()
like image 25
SlyFox Avatar answered Feb 05 '26 18:02

SlyFox