How do I select the longest string from a list of strings in polars?
Example and expected output:
import polars as pl
df = pl.DataFrame({
"values": [
["the", "quickest", "brown", "fox"],
["jumps", "over", "the", "lazy", "dog"],
[]
]
})
┌──────────────────────────────┬────────────────┐
│ values ┆ longest_string │
│ --- ┆ --- │
│ list[str] ┆ str │
╞══════════════════════════════╪════════════════╡
│ ["the", "quickest", … "fox"] ┆ quickest │
│ ["jumps", "over", … "dog"] ┆ jumps │
│ [] ┆ null │
└──────────────────────────────┴────────────────┘
My use case is to select the longest overlapping match.
Edit: elaborating on the longest overlapping match, this is the output for the example provided by polars:
┌────────────┬───────────┬─────────────────────────────────┐
│ values ┆ matches ┆ matches_overlapping │
│ --- ┆ --- ┆ --- │
│ str ┆ list[str] ┆ list[str] │
╞════════════╪═══════════╪═════════════════════════════════╡
│ discontent ┆ ["disco"] ┆ ["disco", "onte", "discontent"] │
└────────────┴───────────┴─────────────────────────────────┘
I desire a way to select the longest match in matches_overlapping
.
You can do something like:
df.with_columns(
pl.col('values').list.get(
pl.col('values')
.list.eval(pl.element().str.len_chars())
.list.arg_max()
)
.alias('longest_string')
)
This expression:
pl.col('values')
.list.eval(pl.element().str.len_chars())
.list.arg_max()
first maps len_chars
to each string in each of the lists with .list.eval
, then it finds the arg_max
(the index of the max element, so in this case, the index of the max length).
The result of that is passed to list.get
to retrieve those values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With