I have a dataset with the elapsed time of certain events and I would like to know the total time spent in each time bracket using polars.
The hist function gives the count of events within each bracket like this:
>>> df = pl.DataFrame(
... {
... "nam": ["a", "a", "a", "b", "b", "b", "b"],
... "ela": [1, 1, 20, 3, 4, 10, 20],
... }
... )
>>> df.group_by("nam").agg(
... pl.col("ela").hist([ 1, 10,])
... )
shape: (2, 2)
┌─────┬───────────┐
│ nam ┆ ela │
│ --- ┆ --- │
│ str ┆ list[u32] │
╞═════╪═══════════╡
│ a ┆ [2, 0, 1] │
│ b ┆ [0, 3, 1] │
└─────┴───────────┘
What I would like to see is something like this
shape: (2, 2)
┌─────┬─────────────┐
│ nam ┆ ela │
│ --- ┆ --- │
│ str ┆ list[u32] │
╞═════╪═════════════╡
│ a ┆ [2, 0, 20] │
│ b ┆ [0, 17, 20] │
└─────┴─────────────┘
I've tried to use cut and implode
>>> (df.with_columns(pl.col("ela").cut([ 1, 10,]).alias("bucket"))
... .group_by("nam", "bucket")
... .agg(ela=pl.sum("ela"))
... .group_by("nam")
... .agg(pl.sum("ela").alias("ela_sum"), pl.implode("bucket", "ela"))
... )
shape: (2, 4)
┌─────┬─────────┬──────────────────────────────┬─────────────────┐
│ nam ┆ ela_sum ┆ bucket ┆ ela │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ list[list[cat]] ┆ list[list[i64]] │
╞═════╪═════════╪══════════════════════════════╪═════════════════╡
│ b ┆ 37 ┆ [["(1, 10]", "(10, inf]"]] ┆ [[17, 20]] │
│ a ┆ 22 ┆ [["(-inf, 1]", "(10, inf]"]] ┆ [[2, 20]] │
└─────┴─────────┴──────────────────────────────┴─────────────────┘
but I don't know how to get the compact inclusive format that hist. It is missing the zero buckets and for some reason it shows up as a list within list.
Maybe not really elegant, but you could do something like:
df_res = df.with_columns(bucket = pl.col.ela.cut([ 1, 10,]).to_physical())
(
pl.DataFrame(pl.Series("bucket", range(3), dtype=pl.UInt32))
.join(df.select(pl.col.nam.unique()), how="cross")
.join(df_res, on=["bucket","nam"], how="left")
.group_by("nam","bucket").sum()
.sort("nam","bucket")
.group_by("nam", maintain_order=True).agg(pl.col.ela)
)
shape: (2, 2)
┌─────┬─────────────┐
│ nam ┆ ela │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞═════╪═════════════╡
│ a ┆ [2, 0, 20] │
│ b ┆ [0, 17, 20] │
└─────┴─────────────┘
Here's one ugly way to do it, I sure hope there are better ways:
import polars as pl
import polars.selectors as cs
df = pl.DataFrame(
{
"nam": ["a", "a", "a", "b", "b", "b", "b"],
"ela": [1, 1, 20, 3, 4, 10, 20],
}
)
(
df.with_columns(pl.col("ela").cut([1, 10,]).alias("bucket"))
.group_by("nam", "bucket")
.agg(
b=pl.col("ela").hist([1, 10,]),
c=pl.col("ela").sum()
)
.select(
(
pl.col("b").list.eval(
pl.element() > 0
) * pl.col("c")
).list.to_struct(),
pl.col("nam")
)
.unnest("b")
.group_by("nam")
.agg(pl.concat_list(cs.numeric().sum()).alias("bucket"))
)
This gives you the result of:
┌─────┬─────────────┐
│ nam ┆ bucket │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞═════╪═════════════╡
│ a ┆ [2, 0, 20] │
│ b ┆ [0, 17, 20] │
└─────┴─────────────┘
Breaking it down a bit, combining your two attempts to one that we can use:
df.with_columns(pl.col("ela").cut([1, 10,]).alias("bucket"))
.group_by("nam", "bucket")
.agg(
b=pl.col("ela").hist([1, 10,]),
c=pl.col("ela").sum()
)
┌─────┬───────────┬───────────┬─────┐
│ nam ┆ bucket ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ cat ┆ list[u32] ┆ i64 │
╞═════╪═══════════╪═══════════╪═════╡
│ b ┆ (10, inf] ┆ [0, 0, 1] ┆ 20 │
│ a ┆ (-inf, 1] ┆ [2, 0, 0] ┆ 2 │
│ a ┆ (10, inf] ┆ [0, 0, 1] ┆ 20 │
│ b ┆ (1, 10] ┆ [0, 3, 0] ┆ 17 │
└─────┴───────────┴───────────┴─────┘
And then the rest is an ugly way to transform that to he final result by
concat_list and numeric selector.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With