Lets assume I have this Polars df:
import polars as pl
df = pl.DataFrame({'name': ['a', 'a', 'b', 'b'],
'found': ['yes', 'no', 'no', 'no'],
'found_in_iteration': ['1', 'not_found', 'not_found', 'not_found']})
(This df was generated in two iterations and for a some data was found in the first, but not in the second and for b no data was found in any iteration.) No I want to keep only those rows, where data was found, but if for one name, not data was found at all, I want to keep this information that it was not found once. The latter is pretty easy doable with unique(), but how do I get rid of the a,no row, if an a,yes exists. The solution should be usable even with more iterations like here:
df = pl.DataFrame({'name': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
'found': ['yes', 'no', 'yes', 'no', 'no', 'no', 'no', 'yes', 'no'],
'found_in_iteration': ['1', 'not_found', '3', 'not_found', 'not_found', 'not_found', 'not_found', '2', 'not_found']})
Here, a was found in two iterations, b was found not at all and c was found in one. Here for a, both columns with found == yes should be kept, for c, the one column with found == yes should be kept and for b one column with found == no should be kept. So this:
shape: (9, 3)
┌──────┬───────┬────────────────────┐
│ name ┆ found ┆ found_in_iteration │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════╪═══════╪════════════════════╡
│ a ┆ yes ┆ 1 │
│ a ┆ no ┆ not_found │
│ a ┆ yes ┆ 3 │
│ b ┆ no ┆ not_found │
│ b ┆ no ┆ not_found │
│ b ┆ no ┆ not_found │
│ c ┆ no ┆ not_found │
│ c ┆ yes ┆ 2 │
│ c ┆ no ┆ not_found │
└──────┴───────┴────────────────────┘
should be reduced to this:
shape: (9, 3)
┌──────┬───────┬────────────────────┐
│ name ┆ found ┆ found_in_iteration │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════╪═══════╪════════════════════╡
│ a ┆ yes ┆ 1 │
│ a ┆ yes ┆ 3 │
│ b ┆ no ┆ not_found │
│ c ┆ yes ┆ 2 │
└──────┴───────┴────────────────────┘
Here's one approach:
out = (
df.unique(maintain_order=True)
.filter(
pl.col("found").eq('yes') | pl.col("name").is_unique()
)
)
Output:
shape: (4, 3)
┌──────┬───────┬────────────────────┐
│ name ┆ found ┆ found_in_iteration │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════╪═══════╪════════════════════╡
│ a ┆ yes ┆ 1 │
│ a ┆ yes ┆ 3 │
│ b ┆ no ┆ not_found │
│ c ┆ yes ┆ 2 │
└──────┴───────┴────────────────────┘
Explanation
df.unique to drop duplicates (with maintain_order=True if important)..filter to keep rows where 'found' equals 'yes' (Expr.eq) or 'name' is unique (Expr.is_unique).If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With