Here's the solution I came up with for the problem:
import polars as pl
import numpy as np
max_groups = 5
max_reps = 3
# print out all rows in our table, for the sake of convenience
pl.Config.set_tbl_rows(max_groups * max_reps)
num_groups = np.random.randint(3, max_groups + 1)
unique_ids = np.random.randint(97, 123, num_groups)
repetitions = np.random.randint(1, max_reps + 1, num_groups)
id_col = "id"
data_col = "point"
index_col = "ixs"
# # Generate data
# convert integers to ascii using `chr`
ids = pl.Series(
id_col,
[c for n, id in zip(repetitions, unique_ids) for c in [chr(id)] * n],
)
data = pl.Series(
data_col,
np.random.rand(len(ids)),
)
df = pl.DataFrame([ids, data])
# # Generate indices
df.sort(id_col, data_col).group_by(id_col).agg(
pl.col(data_col), pl.int_range(pl.len()).alias(index_col)
).explode(data_col, index_col).sort(id_col, data_col)
shape: (7, 3)
┌─────┬──────────┬─────┐
│ id ┆ point ┆ ixs │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 │
╞═════╪══════════╪═════╡
│ g ┆ 0.030686 ┆ 0 │
│ g ┆ 0.322024 ┆ 1 │
│ k ┆ 0.124792 ┆ 0 │
│ k ┆ 0.289025 ┆ 1 │
│ s ┆ 0.485742 ┆ 0 │
│ s ┆ 0.689397 ┆ 1 │
│ u ┆ 0.516705 ┆ 0 │
└─────┴──────────┴─────┘
Can I do better? I sort twice, for instance: once before grouping, and once after. I can eliminate the need for the second sort by maintain_order=True in the group_by:
# # Generate indices, but maintain_order in group_by
df.sort(id_col, data_col).group_by(id_col, maintain_order=True).agg(
pl.col(data_col), pl.int_range(pl.len()).alias(index_col)
).explode(data_col, index_col)
(Some simple, very naive, timeit based experments suggest maintain_order=True generally wins over sorting twice, but not by a large margin.)
You were already on the right track using pl.int_range. However, you don't need a group_by / agg construct here. Instead, a window function can be used with pl.Expr.over. This allows you to evaluate pl.int_range separately in each group.
It might look as follows.
(
df
.sort(id_col, data_col)
.with_columns(
pl.int_range(pl.len()).over(id_col)
)
)
shape: (8, 3)
┌─────┬──────────┬─────────┐
│ id ┆ point ┆ literal │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 │
╞═════╪══════════╪═════════╡
│ m ┆ 0.291593 ┆ 0 │
│ m ┆ 0.60665 ┆ 1 │
│ q ┆ 0.480906 ┆ 0 │
│ q ┆ 0.545202 ┆ 1 │
│ q ┆ 0.706958 ┆ 2 │
│ t ┆ 0.156814 ┆ 0 │
│ y ┆ 0.460135 ┆ 0 │
│ y ┆ 0.631585 ┆ 1 │
└─────┴──────────┴─────────┘
You can use .rank on pl.col(data_col) with .over(id_col) instead.
rank is 1-based so you'll have to subtract 1.
df = df.sort(id_col, data_col).with_columns(
ixs=(pl.col(data_col).rank("ordinal").over(id_col) - 1)
)
Output with np.random.seed(0) (matches the dataframe your code generates):
shape: (7, 3)
┌─────┬──────────┬─────┐
│ id ┆ point ┆ ixs │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ u32 │
╞═════╪══════════╪═════╡
│ a ┆ 0.528895 ┆ 0 │
│ a ┆ 0.568045 ┆ 1 │
│ a ┆ 0.791725 ┆ 2 │
│ p ┆ 0.437587 ┆ 0 │
│ p ┆ 0.891773 ┆ 1 │
│ v ┆ 0.383442 ┆ 0 │
│ v ┆ 0.963663 ┆ 1 │
└─────┴──────────┴─────┘
Note: rank has many modes you can choose from to handle ties. You can read about it in the link above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With