Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delta table statistics

the log of a delta table stores metadata about the transactions and about statistics (data type, min, max, nr. columns etc). However, I can only see the data types when looking into the json file of this log. Does anyone know how to obtain the min, max and nr. columns of this delta table without computing anything (since the delta table should have this information when reading the file)?

like image 606
yvdw Avatar asked Oct 28 '25 08:10

yvdw


1 Answers

With the release of data skipping in Delta Lake 1.2.0, column-level statistics like min/max are now available. Statistics are saved in the Delta Lake transaction log every time an add action is performed corresponding to adding a new parquet file. Using the delta-rs Python package (v0.8.0 or newer) we can retrieve and read these add action records from the transaction log to view the file-level statistics. First, install the package with pip or conda:

# using pip:
$ pip install deltalake

# using conda:
$ conda install -c conda-forge delta-spark

We'll create a dummy Delta Table using pandas and delta-rs. Then we'll read the add action records using the get_add_actions method:

from deltalake import DeltaTable, write_deltalake
import pandas as pd

df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
write_deltalake("tmp", df, partition_by=["x"])
dt = DeltaTable("tmp")
dt.get_add_actions().to_pandas()

You'll get the following output (exported as tsv for inclusion):

    path    size_bytes  modification_time   data_change partition_values    num_records null_count  min max
0   x=1/0-84bac732-65a4-4217-8782-14933373ad4b-0.parquet    1867    2023-03-28 23:28:25.634 True    {'x': 1}    1   {'y': 0}    {'y': 4}    {'y': 4}
1   x=2/0-84bac732-65a4-4217-8782-14933373ad4b-0.parquet    1867    2023-03-28 23:28:25.634 True    {'x': 2}    1   {'y': 0}    {'y': 5}    {'y': 5}
2   x=3/0-84bac732-65a4-4217-8782-14933373ad4b-0.parquet    1867    2023-03-28 23:28:25.634 True    {'x': 3}    1   {'y': 0}    {'y': 6}    {'y': 6}

note: each statistic will have its own column in the DataFrame, e.g. num_records, null_count, min, and max.

like image 107
Jim Hibbard Avatar answered Oct 29 '25 22:10

Jim Hibbard