There are occasions when I know ahead of time the full schema of a table I'm working with. In those scenarios, it would be nice to be able to specify the full schema (call it a FullyDefinedFrame). Then the type system could help me out with things like:
FullyDefinedFrame.I understand that polars does this at run time once it has the full schema of the data it's working on. But what if you could get all that information while still developing?
At the moment, I imagine you could get a crummy version of this experience by having a tool that creates a dummy LazyFrame/DataFrame with the schema of the FullyDefinedFrame, and then call your functions on it, and give you the results.
Is this possible in general? And if so, what would it take to make it work?
The closest I have come so far, is to write my functions with LazyFrames, and then write a test that calls the function with a LazyFrame with the correct schema, but no data. This is all based on the documentation on type checking in the lazy API.
# example.py
from datetime import date
import polars as pl
def my_fn(lf: pl.LazyFrame) -> pl.LazyFrame:
"""
Expects the schema
Schema({'date': pl.Date, 'employee_id': pl.Int32, 'value': pl.Float64})
Performs a filter, then groupby, and sums the values.
"""
return (
lf.filter(pl.col("date") == date(2025, 1, 1))
.group_by(["date", "employee_id"])
.agg(pl.col("value").sum().alias("total_value"))
)
and a test file
# test_my_fn.py
from example import my_fn
import polars as pl
def test_my_types_and_schema():
# The schema for the input LazyFrame
input_schema = pl.Schema(
[
("date", pl.Date),
("employee_id", pl.Int32),
("value", pl.Float64),
]
)
# Create a LazyFrame with no data, but with the correct schema
lf = pl.LazyFrame(schema=input_schema)
# Call the function
out = my_fn(lf)
# This will raise an error if the type checking fails
out.collect()
# If we also know what the output schema should be, define it here, and
# compare with the output schema of the function.
expected_schema = pl.Schema(
[
("date", pl.Date),
("employee_id", pl.Int32),
("total_value", pl.Float64),
]
)
assert expected_schema == out.collect_schema()
Then run the test to make sure all the operations are type safe, and the schema matches the expected schema.
If, for example, we change the date column in the input schema in the test to be a string ...
input_schema = pl.Schema(
[
("date", pl.String),
("employee_id", pl.Int32),
("value", pl.String),
]
)
and re-run the test, we get the error:
E polars.exceptions.InvalidOperationError: cannot compare 'date/datetime/time' to a string value (create native python { 'date', 'datetime', 'time' } or compare to a temporal column)
E
E Resolved plan until failure:
E
E ---> FAILED HERE RESOLVING 'group_by' <---
E FILTER [(col("date")) == (2025-01-01)] FROM
E DF ["date", "employee_id", "value"]; PROJECT */3 COLUMNS
This works pretty well, but is definitely more work intensive than I had first hoped. Side effect is that it is nice for ensuring the
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With