Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Attaching an "in-group index" to each row of sorted data with Polars [duplicate]

Here's the solution I came up with for the problem:

import polars as pl
import numpy as np

max_groups = 5
max_reps = 3

# print out all rows in our table, for the sake of convenience
pl.Config.set_tbl_rows(max_groups * max_reps)

num_groups = np.random.randint(3, max_groups + 1)
unique_ids = np.random.randint(97, 123, num_groups)
repetitions = np.random.randint(1, max_reps + 1, num_groups)

id_col = "id"
data_col = "point"
index_col = "ixs"

# # Generate data
# convert integers to ascii using `chr`
ids = pl.Series(
    id_col,
    [c for n, id in zip(repetitions, unique_ids) for c in [chr(id)] * n],
)
data = pl.Series(
    data_col,
    np.random.rand(len(ids)),
)
df = pl.DataFrame([ids, data])

# # Generate indices
df.sort(id_col, data_col).group_by(id_col).agg(
    pl.col(data_col), pl.int_range(pl.len()).alias(index_col)
).explode(data_col, index_col).sort(id_col, data_col)
shape: (7, 3)
┌─────┬──────────┬─────┐
│ id  ┆ point    ┆ ixs │
│ --- ┆ ---      ┆ --- │
│ str ┆ f64      ┆ i64 │
╞═════╪══════════╪═════╡
│ g   ┆ 0.030686 ┆ 0   │
│ g   ┆ 0.322024 ┆ 1   │
│ k   ┆ 0.124792 ┆ 0   │
│ k   ┆ 0.289025 ┆ 1   │
│ s   ┆ 0.485742 ┆ 0   │
│ s   ┆ 0.689397 ┆ 1   │
│ u   ┆ 0.516705 ┆ 0   │
└─────┴──────────┴─────┘

Can I do better? I sort twice, for instance: once before grouping, and once after. I can eliminate the need for the second sort by maintain_order=True in the group_by:

# # Generate indices, but maintain_order in group_by
df.sort(id_col, data_col).group_by(id_col, maintain_order=True).agg(
    pl.col(data_col), pl.int_range(pl.len()).alias(index_col)
).explode(data_col, index_col)

(Some simple, very naive, timeit based experments suggest maintain_order=True generally wins over sorting twice, but not by a large margin.)

like image 783
bzm3r Avatar asked Dec 07 '25 08:12

bzm3r


2 Answers

You were already on the right track using pl.int_range. However, you don't need a group_by / agg construct here. Instead, a window function can be used with pl.Expr.over. This allows you to evaluate pl.int_range separately in each group.

It might look as follows.

(
    df
    .sort(id_col, data_col)
    .with_columns(
        pl.int_range(pl.len()).over(id_col)
    )
)
shape: (8, 3)
┌─────┬──────────┬─────────┐
│ id  ┆ point    ┆ literal │
│ --- ┆ ---      ┆ ---     │
│ str ┆ f64      ┆ i64     │
╞═════╪══════════╪═════════╡
│ m   ┆ 0.291593 ┆ 0       │
│ m   ┆ 0.60665  ┆ 1       │
│ q   ┆ 0.480906 ┆ 0       │
│ q   ┆ 0.545202 ┆ 1       │
│ q   ┆ 0.706958 ┆ 2       │
│ t   ┆ 0.156814 ┆ 0       │
│ y   ┆ 0.460135 ┆ 0       │
│ y   ┆ 0.631585 ┆ 1       │
└─────┴──────────┴─────────┘
like image 111
Hericks Avatar answered Dec 12 '25 17:12

Hericks


You can use .rank on pl.col(data_col) with .over(id_col) instead.

rank is 1-based so you'll have to subtract 1.

df = df.sort(id_col, data_col).with_columns(
    ixs=(pl.col(data_col).rank("ordinal").over(id_col) - 1)
)

Output with np.random.seed(0) (matches the dataframe your code generates):

shape: (7, 3)
┌─────┬──────────┬─────┐
│ id  ┆ point    ┆ ixs │
│ --- ┆ ---      ┆ --- │
│ str ┆ f64      ┆ u32 │
╞═════╪══════════╪═════╡
│ a   ┆ 0.528895 ┆ 0   │
│ a   ┆ 0.568045 ┆ 1   │
│ a   ┆ 0.791725 ┆ 2   │
│ p   ┆ 0.437587 ┆ 0   │
│ p   ┆ 0.891773 ┆ 1   │
│ v   ┆ 0.383442 ┆ 0   │
│ v   ┆ 0.963663 ┆ 1   │
└─────┴──────────┴─────┘

Note: rank has many modes you can choose from to handle ties. You can read about it in the link above.

like image 21
Dogbert Avatar answered Dec 12 '25 15:12

Dogbert



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!