What is the difference between using square brackets [ ] and using Expression APIs like select, filter, etc. when trying to access data from a polars Dataframe? Which one to use when?
a polars dataframe
df = pl.DataFrame(
{
"a": ["a", "b", "a", "b", "b", "c"],
"b": [2, 1, 1, 3, 2, 1],
}
)
Usually, it is advised to use the polars expression API as most of the methods are part of polars lazy API. The API defers the evaluation of many operations, such as selection, filtering, and mutation, until the result is actually needed. This allows for the query optimizations that makes polars as efficient as it is.
In contrast, accessing columns using square bracket notation only work in eager-mode.
As an concrete example, consider reading the first element of a rather large .csv file (~700mb). We start by creating the .csv file and write it to disk.
import polars as pl
pl.DataFrame({"col": [0] * 100_000_000}).write_csv("df.csv")
Using square bracket notation ([).
As square bracket notation does not work on LazyFrames, we'll need to use pl.read_parquet to read the file into a DataFrame object.
pl.read_csv("df.csv")["col"][0]
On my machine, this takes roughly 200ms.
Using polars lazy API.
The same can be achieved using polars lazy API (pl.scan_csv, select, first, item).
pl.scan_csv("df.csv").select("col").first().collect().item()
The delay of the evaluation allows polars to avoid reading in the entire file and the execution only takes 300μs on my machine, i.e. we obtained a ~650x speed-up by using polars' lazy API.
Of course, one could argue that much of the time here is saved by avoiding to read the entire .csv file instead of by using select over [. However, note that this is still due to the benefits of the lazy API.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With