The breakpoints data is the following:
breakpoints = pl.DataFrame(
{
"features": ["feature_0", "feature_0", "feature_1"],
"breakpoints": [0.1, 0.5, 1],
"n_possible_bins": [3, 3, 2],
}
)
print(breakpoints)
out:
shape: (3, 3)
┌───────────┬─────────────┬─────────────────┐
│ features ┆ breakpoints ┆ n_possible_bins │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 │
╞═══════════╪═════════════╪═════════════════╡
│ feature_0 ┆ 0.1 ┆ 3 │
│ feature_0 ┆ 0.5 ┆ 3 │
│ feature_1 ┆ 1.0 ┆ 2 │
└───────────┴─────────────┴─────────────────┘
The df has two continous variables that we wish to encode according to the breakpoints DataFrame:
df = pl.DataFrame(
{"feature_0": [0.05, 0.2, 0.6, 0.8], "feature_1": [0.5, 1.5, 1.0, 1.1]}
)
print(df)
out:
shape: (4, 2)
┌───────────┬───────────┐
│ feature_0 ┆ feature_1 │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═══════════╪═══════════╡
│ 0.05 ┆ 0.5 │
│ 0.2 ┆ 1.5 │
│ 0.6 ┆ 1.0 │
│ 0.8 ┆ 1.1 │
└───────────┴───────────┘
After the encoding we should have the resulting DataFrame encoded_df:
encoded_df = pl.DataFrame({"feature_0": [0, 1, 2, 2], "feature_1": [0, 1, 0, 1]})
print(encoded_df)
out:
shape: (4, 2)
┌───────────┬───────────┐
│ feature_0 ┆ feature_1 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════════╪═══════════╡
│ 0 ┆ 0 │
│ 1 ┆ 1 │
│ 2 ┆ 0 │
│ 2 ┆ 1 │
└───────────┴───────────┘
encoded_df are also available in breakpointsnp.array([str(i) for i in range(n_possible_bins)]), assuming n_possible_bins is a positive integer. n_possible_bins may be different across features.left_closed=False where the bins are defined as (breakpoint, next breakpoint]I know that Polars.Expr.cut() takes in breaks parameter as Sequence[float], but how do I pass in these breakpoints and labels from the breakpoints DataFrame effectively?
Given that breakpoints will most likely be a very small DataFrame, I think the simplest and most efficient solution is something like:
import polars as pl
breakpoints = pl.DataFrame(
{
"features": ["feature_0", "feature_0", "feature_1"],
"breakpoints": [0.1, 0.5, 1],
"n_possible_feature_brakes": [3, 3, 2],
}
)
df = pl.DataFrame(
{"feature_0": [0.05, 0.2, 0.6, 0.8], "feature_1": [0.5, 1.5, 1.0, 1.1]}
)
# Aggregate the breakpoints by feature
feature_breaks = breakpoints.group_by("features").agg(
pl.col("breakpoints").sort().alias("breaks")
)
# For each feature, call `pl.cut` with the respective `breaks`
result = df.select(
pl.col(feat).cut(breaks, labels=[str(x) for x in range(len(breaks) + 1)])
for feat, breaks in feature_breaks.iter_rows()
)
Output:
>>> feature_breaks
shape: (2, 2)
┌───────────┬────────────┐
│ features ┆ breaks │
│ --- ┆ --- │
│ str ┆ list[f64] │
╞═══════════╪════════════╡
│ feature_0 ┆ [0.1, 0.5] │
│ feature_1 ┆ [1.0] │
└───────────┴────────────┘
>>> result
shape: (4, 2)
┌───────────┬───────────┐
│ feature_0 ┆ feature_1 │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════════╪═══════════╡
│ 0 ┆ 0 │
│ 1 ┆ 1 │
│ 2 ┆ 0 │
│ 2 ┆ 1 │
└───────────┴───────────┘
Going purely with polars operations for the entire process, you can convert the breakpoints into a series of ranges then use join_where to join where before < value <= next
import polars as pl
df = pl.DataFrame(
{"feature_0": [0.05, 0.2, 0.6, 0.8], "feature_1": [0.5, 1.5, 1.0, 1.1]}
)
breakpoints = pl.DataFrame(
{
"features": ["feature_0", "feature_0", "feature_1"],
"breakpoints": [0.1, 0.5, 1],
"n_possible_bins": [3, 3, 2],
}
)
# Aggregate the breakpoints into lists and append -inf, +inf to the edges
points = (
breakpoints.group_by('features').agg(pl.col('breakpoints'))
.with_columns(pl.concat_list(pl.lit(float('-inf')), pl.col('breakpoints'), pl.lit(float('inf'))).alias('breakpoints'))
)
# Turn that into one row for each cut
size = pl.col('breakpoints').list.len() - 1
intervals = points.select(
pl.col('features'),
pl.col('breakpoints').list.head(size).alias('min'),
pl.col('breakpoints').list.tail(size).alias('max'),
pl.int_ranges(0, size).alias('idx'),
).explode('min', 'max', 'idx')
# Now, you *could* use a loop to join for each column instead instead, but going full polars...
# Melt the df such that we can perform the actual join in a single operation
melted = df.with_row_index('idx').unpivot(index='idx', variable_name='feature')
# Join based on the ranges
joined = melted.join_where(
intervals.rename({'features': 'feature', 'idx': 'encoded'}),
pl.col('feature').eq(pl.col('feature_right')),
# Change `closed=...` if you want
pl.col('value').is_between(pl.col('min'), pl.col('max'), closed='right')
)
# Return to the original format
result = joined.pivot('feature', index='idx', values='encoded')
print(result.drop('idx'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With