I am trying to bin values to prepare data to be later fed into a plotting library.
For this I am trying to use polars Expr.cut. The dataframe I operate on contains different groups of values, each of these groups should be binned using different breaks. Ideally I would like to use np.linspace(BinMin, BinMax, 50) for the breaks argument of Expr.cut.
I managed to make the BinMin and BinMax columns in the dataframe. But I can't manage to use np.linspace to define the breaks dynamically for each row of the dataframe.
This is a minimal example of what I tried:
import numpy as np
import polars as pl
df = pl.DataFrame({"Value": [12], "BinMin": [0], "BinMax": [100]})
At this point the dataframe looks like:
┌───────┬────────┬────────┐
│ Value ┆ BinMin ┆ BinMax │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═══════╪════════╪════════╡
│ 12 ┆ 0 ┆ 100 │
└───────┴────────┴────────┘
And trying to use Expr.cut with dynamic breaks:
df.with_columns(pl.col("Value").cut(breaks=np.linspace(pl.col("BinMin"), pl.col("BinMax"), 50)).alias("Bin"))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[10], line 1
----> 1 df.with_columns(pl.col("Value").cut(breaks=range(pl.col("BinMin"), pl.col("BinMax"))).alias("Bin"))
TypeError: 'Expr' object cannot be interpreted as an integer
I understand the error, that np.linspace is expecting to be called with actual scalar integers, not polars Expr. But I cannot figure out how to call it with dynamic breaks derived from the BinMin and BinMax columns.
Unfortunately, pl.Expr.cut doesn't support expressions for the breaks argument (yet), but requires a fixed sequence.
(This would be a good feature request though).
A naive solution that will work for DataFrames, but doesn't use polars' native expression API, would be to use pl.Expr.map_elements together with the corresponding functionality in numpy.
def my_cut(x, num=50):
seq = np.linspace(x["BinMin"], x["BinMax"], num=num)
idx = np.digitize(x["Value"], seq)
return seq[idx-1:idx+1].tolist()
(
df
.with_columns(
pl.struct("Value", "BinMin", "BinMax").map_elements(my_cut).alias("Bin")
)
)
shape: (1, 4)
┌───────┬────────┬────────┬────────────────────────┐
│ Value ┆ BinMin ┆ BinMax ┆ Bin │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ list[f64] │
╞═══════╪════════╪════════╪════════════════════════╡
│ 12 ┆ 0 ┆ 100 ┆ [10.204082, 12.244898] │
└───────┴────────┴────────┴────────────────────────┘
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With