Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Polars: Number of Rows until the next value in a group

Given a polars DataFrame:
data = pl.DataFrame({"user_id": [1, 1, 1, 2, 2, 2], "login": [False, True, False, False, False, True]})

How could I add a column which adds the number of rows until the user next logs in, with any rows after the last login for that user being set to None? Example output for the above data is
[1, 0, None, 2, 1, 0]

I have tried adapting the answer from here with a backward_fill() but can not get it working

like image 989
cdkdrf Avatar asked Sep 06 '25 19:09

cdkdrf


2 Answers

IIUC, you have to use backward_fill and invert the subtraction:

(data
   .with_row_index()
   .with_columns(distance = 
      pl.when("login").then("index").backward_fill().over("user_id") - pl.col.index
   )
)

Output:

┌───────┬─────────┬───────┬──────────┐
│ index ┆ user_id ┆ login ┆ distance │
│ ---   ┆ ---     ┆ ---   ┆ ---      │
│ u32   ┆ i64     ┆ bool  ┆ u32      │
╞═══════╪═════════╪═══════╪══════════╡
│ 0     ┆ 1       ┆ false ┆ 1        │
│ 1     ┆ 1       ┆ true  ┆ 0        │
│ 2     ┆ 1       ┆ false ┆ null     │
│ 3     ┆ 2       ┆ false ┆ 2        │
│ 4     ┆ 2       ┆ false ┆ 1        │
│ 5     ┆ 2       ┆ true  ┆ 0        │
└───────┴─────────┴───────┴──────────┘
like image 147
mozway Avatar answered Sep 11 '25 02:09

mozway


You can create reverse index using step parameter.

i_expr = pl.int_range(pl.len(), 0, step=-1)

(
    df.with_columns(
        (i_expr - pl.when("login").then(i_expr).backward_fill())
        .over('user_id')
        .alias('distance')
    )
)

┌─────────┬───────┬──────────┐
│ user_id ┆ login ┆ distance │
│ ---     ┆ ---   ┆ ---      │
│ i64     ┆ bool  ┆ i64      │
╞═════════╪═══════╪══════════╡
│ 1       ┆ false ┆ 1        │
│ 1       ┆ true  ┆ 0        │
│ 1       ┆ false ┆ null     │
│ 2       ┆ false ┆ 2        │
│ 2       ┆ false ┆ 1        │
│ 2       ┆ true  ┆ 0        │
└─────────┴───────┴──────────┘

Or just reverse subtraction as in @mozway answer:

i_expr = pl.int_range(pl.len())

(
    df.with_columns(
        (pl.when("login").then(i_expr).backward_fill() - i_expr)
        .over('user_id')
        .alias('distance')
    )
)

┌─────────┬───────┬──────────┐
│ user_id ┆ login ┆ distance │
│ ---     ┆ ---   ┆ ---      │
│ i64     ┆ bool  ┆ i64      │
╞═════════╪═══════╪══════════╡
│ 1       ┆ false ┆ 1        │
│ 1       ┆ true  ┆ 0        │
│ 1       ┆ false ┆ null     │
│ 2       ┆ false ┆ 2        │
│ 2       ┆ false ┆ 1        │
│ 2       ┆ true  ┆ 0        │
└─────────┴───────┴──────────┘

Note I've also moved index calculation to separate i_expr variable and shifted over() operation further so you only have to use it once, thus making it's easier to adjust the solution.

like image 33
Roman Pekar Avatar answered Sep 11 '25 00:09

Roman Pekar