Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to put multiple steps into a polars expression function?

I've been learning how to write expression functions in polars which are wonderful for creating self-documenting chained operations. But I'm struggling with more complex functions. Let's say I want to replace a value in column bar with the first value in column baz when baz is empty, over the group column foo. To state more clearly: I have a set of columns which form a a sorted group (in my example only foo). I have another column bar which may or may not have empty values. If the first value in bar for the group is empty ('' or NULL) take the corresponding value from a another column baz and apply to every bar in the group. If the first value in bar is not empty then do nothing to the group.

The below works correctly.

Initial DataFrame:

import polars as pl

df = pl.DataFrame({'foo': [1, 1, 1, 2, 2, 2, 3, 3],
                   'bar': ['a', 'a', 'a', None, None, None, 'c', 'c'],
                   'baz': ['x', None, 'q', 'z', 'r', None, 'y', 's']})
shape: (8, 3)
┌─────┬──────┬──────┐
│ foo ┆ bar  ┆ baz  │
│ --- ┆ ---  ┆ ---  │
│ i64 ┆ str  ┆ str  │
╞═════╪══════╪══════╡
│ 1   ┆ a    ┆ x    │
│ 1   ┆ a    ┆ null │
│ 1   ┆ a    ┆ q    │
│ 2   ┆ null ┆ z    │
│ 2   ┆ null ┆ r    │
│ 2   ┆ null ┆ null │
│ 3   ┆ c    ┆ y    │
│ 3   ┆ c    ┆ s    │
└─────┴──────┴──────┘

Transformation I want to perform:

df = (df.with_columns(pl.col('baz').first().over(['foo']).alias('temp'))
        .with_columns(pl.when((pl.col('bar') == '') | (pl.col('bar').is_null()))
                      .then(pl.col('temp'))
                      .otherwise(pl.col('bar')).alias('bar2'))
        .with_columns(pl.col('bar2').alias('bar'))
        .drop(['temp', 'bar2'])
      )

Expected Result:

┌─────┬──────┬──────┐
│ foo ┆ bar  ┆ baz  │
│ --- ┆ ---  ┆ ---  │
│ i64 ┆ str  ┆ str  │
╞═════╪══════╪══════╡
│ 1   ┆ a    ┆ x    │
│ 1   ┆ a    ┆ null │
│ 1   ┆ a    ┆ q    │
│ 2   ┆ z    ┆ z    │
│ 2   ┆ z    ┆ r    │
│ 2   ┆ z    ┆ null │
│ 3   ┆ c    ┆ y    │
│ 3   ┆ c    ┆ s    │
└─────┴──────┴──────┘

In my actual problem, this chain would be only a subset of a larger chain, so it would be great if I could write

def update_bar() -> pl.expr:
   return (#some voodoo)

and then:

df = (df.with_columns(update_bar())
        .drop(['temp', 'bar2'])
     )

or even

df = (df.with_columns(update_bar())
        .with_columns(pl.col('bar2').alias('bar'))
        .drop(['temp', 'bar2'])
     )

The first two operations at the top go together so I'd really like to avoid writing two functions. Any guidance on how to go about this?

Or maybe someone even has a more clever way to accomplish what I need all this code to do? Note that foo and bar having matching grouping is only true in this simplified example. In my real case foo is 3 columns and bar cannot be used as the group alone.

like image 218
MikeP Avatar asked Oct 15 '25 00:10

MikeP


2 Answers

It seems like the task could be achieved in a single expression?

df.with_columns(
   pl.when((pl.col("bar").first() == "") | pl.col("bar").first().is_null())
     .then(pl.col("baz").first())
     .otherwise(pl.col("bar"))
     .over("foo")
     .alias("new_bar")
)
shape: (8, 4)
┌─────┬──────┬──────┬─────────┐
│ foo ┆ bar  ┆ baz  ┆ new_bar │
│ --- ┆ ---  ┆ ---  ┆ ---     │
│ i64 ┆ str  ┆ str  ┆ str     │
╞═════╪══════╪══════╪═════════╡
│ 1   ┆ a    ┆ x    ┆ a       │
│ 1   ┆ a    ┆ null ┆ a       │
│ 1   ┆ a    ┆ q    ┆ a       │
│ 2   ┆ null ┆ z    ┆ z       │
│ 2   ┆ null ┆ r    ┆ z       │
│ 2   ┆ null ┆ null ┆ z       │
│ 3   ┆ c    ┆ y    ┆ c       │
│ 3   ┆ c    ┆ s    ┆ c       │
└─────┴──────┴──────┴─────────┘

As an expression inside a function:

def update_bar() -> pl.Expr:
    bar_is_empty = pl.any_horizontal( # "easier" way to build | chains
        pl.col("bar").first() == "",
        pl.col("bar").first().is_null()
    )
   
    return (
        pl.when(bar_is_empty)
          .then(pl.col("baz").first())
          .otherwise(pl.col("bar"))
          .over("foo")
          .alias("bar")
    )
>>> df.with_columns(update_bar())
shape: (8, 3)
┌─────┬─────┬──────┐
│ foo ┆ bar ┆ baz  │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ str ┆ str  │
╞═════╪═════╪══════╡
│ 1   ┆ a   ┆ x    │
│ 1   ┆ a   ┆ null │
│ 1   ┆ a   ┆ q    │
│ 2   ┆ z   ┆ z    │
│ 2   ┆ z   ┆ r    │
│ 2   ┆ z   ┆ null │
│ 3   ┆ c   ┆ y    │
│ 3   ┆ c   ┆ s    │
└─────┴─────┴──────┘

There are also Frame/Expression pipe methods which may be relevant:

  • DataFrame.pipe()

  • LazyFrame.pipe()

  • Expr.pipe()

like image 143
jqurious Avatar answered Oct 17 '25 14:10

jqurious


An alternate approach to solve this would be to use pl.coalesce(). It also reads beautifully IMO:

Coalesce the first value of baz over foo with the value of bar

df.with_columns(
  new_bar = pl.coalesce(
      pl.col("bar"), 
      pl.col('baz').first().over("foo")
  )
)

UPDATE: Thanks to @jqurious for pointing out what the OP wanted.

like image 42
Ramnath Avatar answered Oct 17 '25 12:10

Ramnath



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!