I would like to fill a value in a column based on another columns' name, in the Polars library from python (I obtained the following DF by exploding my variables' column names):
Input:
df = pl.from_repr("""
┌────────┬─────────┬────────┬─────┬──────────┐
│ Name   ┆ Average ┆ Median ┆ Q1  ┆ Variable │
│ ---    ┆ ---     ┆ ---    ┆ --- ┆ ---      │
│ str    ┆ i64     ┆ i64    ┆ i64 ┆ str      │
╞════════╪═════════╪════════╪═════╪══════════╡
│ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Average  │
│ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Median   │
│ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Q1       │
│ Banana ┆ 1       ┆ 5      ┆ 10  ┆ Average  │
│ Banana ┆ 1       ┆ 5      ┆ 10  ┆ Median   │
│ Banana ┆ 1       ┆ 5      ┆ 10  ┆ Q1       │
└────────┴─────────┴────────┴─────┴──────────┘
""")
Expected output:
shape: (6, 6)
┌────────┬─────────┬────────┬─────┬──────────┬───────┐
│ Name   ┆ Average ┆ Median ┆ Q1  ┆ Variable ┆ value │
│ ---    ┆ ---     ┆ ---    ┆ --- ┆ ---      ┆ ---   │
│ str    ┆ i64     ┆ i64    ┆ i64 ┆ str      ┆ i64   │
╞════════╪═════════╪════════╪═════╪══════════╪═══════╡
│ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Average  ┆ 2     │
│ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Median   ┆ 3     │
│ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Q1       ┆ 4     │
│ Banana ┆ 1       ┆ 5      ┆ 10  ┆ Average  ┆ 1     │
│ Banana ┆ 1       ┆ 5      ┆ 10  ┆ Median   ┆ 5     │
│ Banana ┆ 1       ┆ 5      ┆ 10  ┆ Q1       ┆ 10    │
└────────┴─────────┴────────┴─────┴──────────┴───────┘
I have tried:
df = df.with_columns(value = pl.col(f"{pl.col.variable}"))
But that does not work because polars perceives the argument as a function (?). Does anyone know how to do this?
Note: I have also tried to transpose the dataframe, which, not only was that computationally expensive, also did not work! Because it would transpose the DF into a 5-rows-long DF. What I need is a (Name * Number of Variables)-rows-long DF.
That is, for example, I have 3 different names (say, Apple, Banana, and Dragonfruit), and I have 3 variables (Average, Median, Q1), then my DF should be 9-rows-long!
You can use
when/then() to check whether the value of the column Variable is the same as the column name.coalesce() to choose first non-empty result.df.with_columns(
    value = pl.coalesce(
        pl.when(pl.col.Variable == col).then(pl.col(col))
        for col in df["Variable"].unique()
    )
)
┌────────┬─────────┬────────┬─────┬──────────┬───────┐
│ Name   ┆ Average ┆ Median ┆ Q1  ┆ Variable ┆ value │
│ ---    ┆ ---     ┆ ---    ┆ --- ┆ ---      ┆ ---   │
│ str    ┆ i64     ┆ i64    ┆ i64 ┆ str      ┆ i64   │
╞════════╪═════════╪════════╪═════╪══════════╪═══════╡
│ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Average  ┆ 2     │
│ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Median   ┆ 3     │
│ Apple  ┆ 2       ┆ 3      ┆ 4   ┆ Q1       ┆ 4     │
│ Banana ┆ 3       ┆ 5      ┆ 10  ┆ Average  ┆ 3     │
│ Banana ┆ 3       ┆ 5      ┆ 10  ┆ Median   ┆ 5     │
│ Banana ┆ 3       ┆ 5      ┆ 10  ┆ Q1       ┆ 10    │
└────────┴─────────┴────────┴─────┴──────────┴───────┘
You can leverage this functionality from polars.when() and polars.then() with a mapping.
For instance,
variable_columns = ["Q1", "Average", "Median"]
expressions = [
    pl.when(pl.col("Variable") == col_name).then(pl.col(col_name))
    for col_name in variable_columns
]
combined_expr = pl.fold(
    acc=pl.lit(None),
    function=lambda acc, x: pl.coalesce(acc, x),
    exprs=expressions
)
df = df.with_columns(combined_expr.alias("Value"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With