Can I get a histogram where each value is the sum instead of count of values in bucket?

Question

I have a dataset with the elapsed time of certain events and I would like to know the total time spent in each time bracket using polars.

The hist function gives the count of events within each bracket like this:

>>> df = pl.DataFrame(
...     {
...         "nam": ["a", "a", "a", "b", "b", "b", "b"],
...         "ela": [1, 1, 20, 3, 4, 10, 20],
...     }
... )
>>> df.group_by("nam").agg(
...     pl.col("ela").hist([ 1, 10,])
... )
shape: (2, 2)
┌─────┬───────────┐
│ nam ┆ ela       │
│ --- ┆ ---       │
│ str ┆ list[u32] │
╞═════╪═══════════╡
│ a   ┆ [2, 0, 1] │
│ b   ┆ [0, 3, 1] │
└─────┴───────────┘

What I would like to see is something like this

shape: (2, 2)
┌─────┬─────────────┐
│ nam ┆ ela         │
│ --- ┆ ---         │
│ str ┆ list[u32]   │
╞═════╪═════════════╡
│ a   ┆ [2, 0, 20]  │
│ b   ┆ [0, 17, 20] │
└─────┴─────────────┘

I've tried to use cut and implode

>>> (df.with_columns(pl.col("ela").cut([ 1, 10,]).alias("bucket"))
...     .group_by("nam", "bucket")
...     .agg(ela=pl.sum("ela"))
...     .group_by("nam")
...     .agg(pl.sum("ela").alias("ela_sum"), pl.implode("bucket", "ela"))
... )
shape: (2, 4)
┌─────┬─────────┬──────────────────────────────┬─────────────────┐
│ nam ┆ ela_sum ┆ bucket                       ┆ ela             │
│ --- ┆ ---     ┆ ---                          ┆ ---             │
│ str ┆ i64     ┆ list[list[cat]]              ┆ list[list[i64]] │
╞═════╪═════════╪══════════════════════════════╪═════════════════╡
│ b   ┆ 37      ┆ [["(1, 10]", "(10, inf]"]]   ┆ [[17, 20]]      │
│ a   ┆ 22      ┆ [["(-inf, 1]", "(10, inf]"]] ┆ [[2, 20]]       │
└─────┴─────────┴──────────────────────────────┴─────────────────┘

but I don't know how to get the compact inclusive format that hist. It is missing the zero buckets and for some reason it shows up as a list within list.

Roman Pekar · Accepted Answer

Maybe not really elegant, but you could do something like:

df_res = df.with_columns(bucket = pl.col.ela.cut([ 1, 10,]).to_physical())

(
    pl.DataFrame(pl.Series("bucket", range(3), dtype=pl.UInt32))
    .join(df.select(pl.col.nam.unique()), how="cross")
    .join(df_res, on=["bucket","nam"], how="left")
    .group_by("nam","bucket").sum()
    .sort("nam","bucket")
    .group_by("nam", maintain_order=True).agg(pl.col.ela)
)

shape: (2, 2)
┌─────┬─────────────┐
│ nam ┆ ela         │
│ --- ┆ ---         │
│ str ┆ list[i64]   │
╞═════╪═════════════╡
│ a   ┆ [2, 0, 20]  │
│ b   ┆ [0, 17, 20] │
└─────┴─────────────┘

Teemu Risikko · Answer

Solution

Here's one ugly way to do it, I sure hope there are better ways:

import polars as pl
import polars.selectors as cs

df = pl.DataFrame(
    {
        "nam": ["a", "a", "a", "b", "b", "b", "b"],
        "ela": [1, 1, 20, 3, 4, 10, 20],
    }
)


(
    df.with_columns(pl.col("ela").cut([1, 10,]).alias("bucket"))
    .group_by("nam", "bucket")
    .agg(
        b=pl.col("ela").hist([1, 10,]),
        c=pl.col("ela").sum()
    )
    .select(
        (
            pl.col("b").list.eval(
                pl.element() > 0
            ) * pl.col("c")
        ).list.to_struct(),
        pl.col("nam")
    )
    .unnest("b")
    .group_by("nam")
    .agg(pl.concat_list(cs.numeric().sum()).alias("bucket"))
)

This gives you the result of:

┌─────┬─────────────┐
│ nam ┆ bucket      │
│ --- ┆ ---         │
│ str ┆ list[i64]   │
╞═════╪═════════════╡
│ a   ┆ [2, 0, 20]  │
│ b   ┆ [0, 17, 20] │
└─────┴─────────────┘

Explanation

Breaking it down a bit, combining your two attempts to one that we can use:

df.with_columns(pl.col("ela").cut([1, 10,]).alias("bucket"))
    .group_by("nam", "bucket")
    .agg(
        b=pl.col("ela").hist([1, 10,]),
        c=pl.col("ela").sum()
)
┌─────┬───────────┬───────────┬─────┐
│ nam ┆ bucket    ┆ b         ┆ c   │
│ --- ┆ ---       ┆ ---       ┆ --- │
│ str ┆ cat       ┆ list[u32] ┆ i64 │
╞═════╪═══════════╪═══════════╪═════╡
│ b   ┆ (10, inf] ┆ [0, 0, 1] ┆ 20  │
│ a   ┆ (-inf, 1] ┆ [2, 0, 0] ┆ 2   │
│ a   ┆ (10, inf] ┆ [0, 0, 1] ┆ 20  │
│ b   ┆ (1, 10]   ┆ [0, 3, 0] ┆ 17  │
└─────┴───────────┴───────────┴─────┘

And then the rest is an ugly way to transform that to he final result by

Counting the non-zero occurrences in lists
Converting to struct and further unnesting to columns (to get element-wise sum of the lists)
Calculating the groupwise sums and aggregating the lists back with concat_list and numeric selector.

Can I get a histogram where each value is the sum instead of count of values in bucket?

Tags:

python

python-polars

blitzkopf

2 Answers

Roman Pekar

Solution

Explanation

Teemu Risikko

Recent Activity

Donate For Us

Can I get a histogram where each value is the sum instead of count of values in bucket?

Tags:

python

python-polars

blitzkopf

2 Answers

Roman Pekar

Solution

Explanation

Teemu Risikko

Related questions

Recent Activity

Donate For Us