Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a weighted sum of some columns in a Polars DataFrame?

I'm looking for an idiomatic way to calculate the weighted sum of a subset of the columns in a Polars DataFrame and add it to the DataFrame as new column. So let's say I want to multiply columns p1-p3 in the DataFrame below by the following weights and then sum them to create a new column.

weights = [7.4, 3.2, -0.13]

df = pl.DataFrame(
    {
        "id": [1, 2, 3, 4],
        "p1": [44.3, 2.3, 2.4, 6.2],
        "p2": [7.3, 8.4, 10.3, 8.443],
        "p3": [70.3, 80.4, 100.3, 80.443],
        "p4": [16.4, 18.2, 11.5, 18.34],
    }
)
df
shape: (4, 5)
┌─────┬──────┬───────┬────────┬───────┐
│ id  ┆ p1   ┆ p2    ┆ p3     ┆ p4    │
│ --- ┆ ---  ┆ ---   ┆ ---    ┆ ---   │
│ i64 ┆ f64  ┆ f64   ┆ f64    ┆ f64   │
╞═════╪══════╪═══════╪════════╪═══════╡
│ 1   ┆ 44.3 ┆ 7.3   ┆ 70.3   ┆ 16.4  │
│ 2   ┆ 2.3  ┆ 8.4   ┆ 80.4   ┆ 18.2  │
│ 3   ┆ 2.4  ┆ 10.3  ┆ 100.3  ┆ 11.5  │
│ 4   ┆ 6.2  ┆ 8.443 ┆ 80.443 ┆ 18.34 │
└─────┴──────┴───────┴────────┴───────┘

I have come up with the following solution that calculates the correct answer, but I feel that there is likely a simpler, more idiomatic method that would enable me to select the columns of interest without needing to re-specify the df within the with_columns function. Any suggestions?

df.with_columns(
    df.select(
        pl.col(col) * pl.lit(weights[i])
        for i, col in enumerate(["p1", "p2", "p3"])
    )
    .fold(lambda c1, c2: c1 + c2)
    .alias("index"),
)
shape: (4, 6)
┌─────┬──────┬───────┬────────┬───────┬──────────┐
│ id  ┆ p1   ┆ p2    ┆ p3     ┆ p4    ┆ index    │
│ --- ┆ ---  ┆ ---   ┆ ---    ┆ ---   ┆ ---      │
│ i64 ┆ f64  ┆ f64   ┆ f64    ┆ f64   ┆ f64      │
╞═════╪══════╪═══════╪════════╪═══════╪══════════╡
│ 1   ┆ 44.3 ┆ 7.3   ┆ 70.3   ┆ 16.4  ┆ 342.041  │
│ 2   ┆ 2.3  ┆ 8.4   ┆ 80.4   ┆ 18.2  ┆ 33.448   │
│ 3   ┆ 2.4  ┆ 10.3  ┆ 100.3  ┆ 11.5  ┆ 37.681   │
│ 4   ┆ 6.2  ┆ 8.443 ┆ 80.443 ┆ 18.34 ┆ 62.44001 │
└─────┴──────┴───────┴────────┴───────┴──────────┘
like image 219
Tikkanz Avatar asked Oct 15 '25 21:10

Tikkanz


2 Answers

sum_horizontal() could help here.

We can do the following:

col_names = ["p1", "p2", "p3"]
weights = [7.4, 3.2, -0.13]
df.with_columns(
    pl.sum_horizontal(
        [pl.col(col_nm) * wgt
         for col_nm, wgt in zip(col_names, weights)]
    ).alias("index")
)
shape: (4, 6)
┌─────┬──────┬───────┬────────┬───────┬──────────┐
│ id  ┆ p1   ┆ p2    ┆ p3     ┆ p4    ┆ index    │
│ --- ┆ ---  ┆ ---   ┆ ---    ┆ ---   ┆ ---      │
│ i64 ┆ f64  ┆ f64   ┆ f64    ┆ f64   ┆ f64      │
╞═════╪══════╪═══════╪════════╪═══════╪══════════╡
│ 1   ┆ 44.3 ┆ 7.3   ┆ 70.3   ┆ 16.4  ┆ 342.041  │
│ 2   ┆ 2.3  ┆ 8.4   ┆ 80.4   ┆ 18.2  ┆ 33.448   │
│ 3   ┆ 2.4  ┆ 10.3  ┆ 100.3  ┆ 11.5  ┆ 37.681   │
│ 4   ┆ 6.2  ┆ 8.443 ┆ 80.443 ┆ 18.34 ┆ 62.44001 │
└─────┴──────┴───────┴────────┴───────┴──────────┘

I used zip instead of enumerate .. but that's a stylistic choice. And I allowed Polars to broadcast the weight scalar to a literal, instead of explictly using pl.lit.

Another trick that may help with readability: we can generate the list of expressions outside the with_columns/select contexts.

For example:

col_names = ["p1", "p2", "p3", "p4"]
weights = [7.4, 3.2, -0.13, 0.0]
wghtd_cols = [
    pl.col(col_nm) * wgt
    for col_nm, wgt in zip(col_names, weights)
    if wgt != 0.0
]

df.with_columns(pl.sum_horizontal(wghtd_cols).alias("index"))
shape: (4, 6)
┌─────┬──────┬───────┬────────┬───────┬──────────┐
│ id  ┆ p1   ┆ p2    ┆ p3     ┆ p4    ┆ index    │
│ --- ┆ ---  ┆ ---   ┆ ---    ┆ ---   ┆ ---      │
│ i64 ┆ f64  ┆ f64   ┆ f64    ┆ f64   ┆ f64      │
╞═════╪══════╪═══════╪════════╪═══════╪══════════╡
│ 1   ┆ 44.3 ┆ 7.3   ┆ 70.3   ┆ 16.4  ┆ 342.041  │
│ 2   ┆ 2.3  ┆ 8.4   ┆ 80.4   ┆ 18.2  ┆ 33.448   │
│ 3   ┆ 2.4  ┆ 10.3  ┆ 100.3  ┆ 11.5  ┆ 37.681   │
│ 4   ┆ 6.2  ┆ 8.443 ┆ 80.443 ┆ 18.34 ┆ 62.44001 │
└─────┴──────┴───────┴────────┴───────┴──────────┘

This is particularly useful when one part of your code is generating the weights and/or selecting columns, and another part of your code is creating the resulting weighted sum column in the DataFrame.

You were almost there. You can use the pl.fold expression directly.

df.with_columns(
    pl.fold(acc=0, function=lambda c1, c2: c1 + c2, exprs=[
        pl.col(col) * pl.lit(weights[i])
        for i, col in enumerate(["p1", "p2", "p3"])
    ]).alias("index")
)
shape: (4, 6)
┌─────┬──────┬───────┬────────┬───────┬──────────┐
│ id  ┆ p1   ┆ p2    ┆ p3     ┆ p4    ┆ index    │
│ --- ┆ ---  ┆ ---   ┆ ---    ┆ ---   ┆ ---      │
│ i64 ┆ f64  ┆ f64   ┆ f64    ┆ f64   ┆ f64      │
╞═════╪══════╪═══════╪════════╪═══════╪══════════╡
│ 1   ┆ 44.3 ┆ 7.3   ┆ 70.3   ┆ 16.4  ┆ 342.041  │
│ 2   ┆ 2.3  ┆ 8.4   ┆ 80.4   ┆ 18.2  ┆ 33.448   │
│ 3   ┆ 2.4  ┆ 10.3  ┆ 100.3  ┆ 11.5  ┆ 37.681   │
│ 4   ┆ 6.2  ┆ 8.443 ┆ 80.443 ┆ 18.34 ┆ 62.44001 │
└─────┴──────┴───────┴────────┴───────┴──────────┘
like image 43
ritchie46 Avatar answered Oct 18 '25 23:10

ritchie46



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!