I am trying to implement some caching logic on a function that acts on a Polars dataframe (this is all in Python).
To avoid needlessly re-computing the result, it'd be great if I could quickly check if the dataframe has changed - ie. a hash comparison.
I am currently using:
_my_hash = df.hash_rows().sum() # int
But curious to know if there are better options.
Here's a another solution which takes @SlyFox's answer and adapts it to use the stdlib hashlib and take the schema into account:
import hashlib
def hash_df(df: pl.DataFrame) -> str:
hasher = hashlib.sha1()
for c, t in df.schema.items():
hasher.update(c.encode())
hasher.update(str(t).encode())
for h in df.hash_rows():
hasher.update(h.to_bytes(64))
return hasher.hexdigest()
You can easily change the hashing function to whatever is available in hashlib without installing any third-party library.
Using df.hash_rows().sum() the following would all have the same hash:
42 as pl.Int8 or pl.Int64)Example:
df = pl.DataFrame({
"foo": [1, 2, 3],
"bar": ["a", "b", "c"]
})
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ a │
│ 2 ┆ b │
│ 3 ┆ c │
└─────┴─────┘
And taking the hash of the following variants
dfs = [
df,
df.reverse(),
df.rename({"foo": "FOO"}),
df.with_columns(pl.col("foo").cast(pl.Int8)),
]
for df in dfs:
print(df.hash_rows().sum())
We get the same hash in all cases (using Polars v1.9.0):
10223265008601598624
10223265008601598624
10223265008601598624
10223265008601598624
Using the proposed hash_df above, we get different hashes in all the variants:
for df in dfs:
print(hash_df(df))
7163cc715a84c3cc2297c8d41182b70855a27b4f
dffdef4f683065b2e02f08c77db7180eef24b798
70d782775a985fdf4d6872b464e31698eb496905
0677b8984f67884d4b54e2405d7892814ea9e07f
An alternative, less efficient approach that doesn't use hash_rows is simply serializing the DataFrame into Parquet and taking its hash:
import io
import hashlib
def hash_df2(df: pl.DataFrame):
with io.BytesIO() as buf:
df.write_parquet(buf)
buf.seek(0)
return hashlib.sha1(buf.read()).hexdigest()
From some simple benchmarking, hash_df2 is about 3-4x slower than hash_df, so definitely not the most efficient approach.
It depends on what you need from a hash, but in my case I needed a hash to enable caching of jobs. This function worked well for that:
import mmh3
import polars as pl
def hash_dataframe(df: pl.DataFrame, seed=42) -> str:
"""Hash a polars DataFrame. Due to the behaviour of pl.DataFrame.hash_rows
this will only be consistent given a polars version.
Args:
df (pl.DataFrame): polars DataFrame to be hashed.
seed (int, optional): Seed for the hash function.
Returns:
str: Hash of the polars DataFrame.
"""
row_hashes = df.hash_rows(seed=seed)
hasher = mmh3.mmh3_x64_128(seed=seed)
for row_hash in row_hashes:
hasher.update(row_hash.to_bytes(64, "little"))
return hasher.digest().hex()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With