I have a string column and I want to make a word count on all text.
DataFrame example:
df = pl.DataFrame({
"Description": [
"Would never order again.",
"I'm not sure it gives me any type of glow and",
"Goes on smoothly a bit sticky and color is glow",
"Preferisco altri prodotti della stessa marca.",
"The moisturizing advertised is non-existent."
]
})
If I am using pandas, I would use .str.split
, stack
and value_counts
pl.from_pandas(
df.to_pandas().Description.str.split(expand=True)
.stack()
.value_counts()
.reset_index()
)
shape: (33, 2)
┌───────────────┬───────┐
│ index ┆ count │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════════════╪═══════╡
│ and ┆ 2 │
│ glow ┆ 2 │
│ is ┆ 2 │
│ Would ┆ 1 │
│ altri ┆ 1 │
│ … ┆ … │
│ not ┆ 1 │
│ I'm ┆ 1 │
│ again. ┆ 1 │
│ order ┆ 1 │
│ non-existent. ┆ 1 │
└───────────────┴───────┘
How would I do this using just Polars?
You can do something like this:
(df.select(pl.col("Description").str.split(" ").flatten().alias("words"))
.group_by("words")
.len()
.sort("len", descending=True)
.filter(pl.col("words").str.len_chars() > 0)
)
shape: (33, 2)
┌───────────────┬─────┐
│ words ┆ len │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════════════╪═════╡
│ is ┆ 2 │
│ and ┆ 2 │
│ glow ┆ 2 │
│ me ┆ 1 │
│ of ┆ 1 │
│ … ┆ … │
│ it ┆ 1 │
│ The ┆ 1 │
│ Would ┆ 1 │
│ non-existent. ┆ 1 │
│ type ┆ 1 │
└───────────────┴─────┘
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With