Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Polars python: interpolation in the groupby context

In polars python I have a datafame with 3 columns x: integers (mod 5 continuous integers with missing values), y: integers and z: str (category). I want to group by the column z and interpolate column x and y. Here is an example dataframe:

┌─────┬─────┬─────┐
│ x   ┆ y   ┆ z   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 5   ┆ 1   ┆ A   │
│ 10  ┆ 2   ┆ A   │
│ 20  ┆ 4   ┆ A   │
│ 25  ┆ 5   ┆ A   │
│ 10  ┆ 2   ┆ B   │
│ 20  ┆ 4   ┆ B   │
│ 30  ┆ 6   ┆ B   │
└─────┴─────┴─────┘

And here is the desired output:

┌─────┬─────┬─────┐
│ x   ┆ y   ┆ z   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 5   ┆ 1   ┆ A   │
│ 10  ┆ 2   ┆ A   │
│ 15  ┆ 3   ┆ A   │
│ 20  ┆ 4   ┆ A   │
│ 25  ┆ 5   ┆ A   │
│ 10  ┆ 2   ┆ B   │
│ 15  ┆ 3   ┆ B   │
│ 20  ┆ 4   ┆ B   │
│ 25  ┆ 5   ┆ B   │
│ 30  ┆ 6   ┆ B   │
└─────┴─────┴─────┘

the steps between each x values (for each category) should be always 5. My real dataframe is very large so I wish I can work with pl.LazyFrame instead of pl.DataFrame

Without the category column z I solved the issue with a join:

import polars as pl

#  Main dataframe
data = dict(x=[10, 20, 30], y=[2, 4, 6])
df = pl.DataFrame(data)

# Dataframe with all x values
step = 5 
df_1 = pl.DataFrame(dict(x=range(df["x"].min(), df["x"].max() + step, step)))

#  merging and interpolation 
print((
    df_1
    .join(df, on="x", how="left")
    .with_columns(pl.col("y").interpolate())
))

and the result was:

┌─────┬─────┐
│ x   ┆ y   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 10  ┆ 2   │
│ 15  ┆ 3   │
│ 20  ┆ 4   │
│ 25  ┆ 5   │
│ 30  ┆ 6   │
└─────┴─────┘

This works as expected, but I can not figure out how to apply this in the group_by context

like image 739
valu Avatar asked Dec 29 '25 21:12

valu


1 Answers

You could extend your example based on pl.DataFrame.join by joining on x and z as follows.

First, we create an upsampled DataFrame (for all groups defined by z) to join on.

upsampled = (
    df
    .group_by("z")
    .agg(
        pl.int_range(pl.col("x").min(), pl.col("x").max()+5, step=5).alias("x")
    )
    .explode("x")
)

Next, we perform a left-join on the upsampled DataFrame and interpolate column y.

(
    upsampled
    .join(
        df,
        on=["x", "z"],
        how="left"
    )
    .with_columns(
        pl.col("y").interpolate()
    )
)

Output (ordering may differ when not setting maintain_order=True in the group_by) .

shape: (10, 3)
┌─────┬─────┬─────┐
│ z   ┆ x   ┆ y   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═════╪═════╪═════╡
│ A   ┆ 5   ┆ 1.0 │
│ A   ┆ 10  ┆ 2.0 │
│ A   ┆ 15  ┆ 3.0 │
│ A   ┆ 20  ┆ 4.0 │
│ A   ┆ 25  ┆ 5.0 │
│ B   ┆ 10  ┆ 2.0 │
│ B   ┆ 15  ┆ 3.0 │
│ B   ┆ 20  ┆ 4.0 │
│ B   ┆ 25  ┆ 5.0 │
│ B   ┆ 30  ┆ 6.0 │
└─────┴─────┴─────┘
like image 69
Hericks Avatar answered Dec 31 '25 09:12

Hericks



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!