Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete row with negative data in Polars, if row with same value in one column and positive data exists

Lets assume I have this Polars df:

import polars as pl

df = pl.DataFrame({'name': ['a', 'a', 'b', 'b'],
                   'found': ['yes', 'no', 'no', 'no'],
                   'found_in_iteration': ['1', 'not_found', 'not_found', 'not_found']})

(This df was generated in two iterations and for a some data was found in the first, but not in the second and for b no data was found in any iteration.) No I want to keep only those rows, where data was found, but if for one name, not data was found at all, I want to keep this information that it was not found once. The latter is pretty easy doable with unique(), but how do I get rid of the a,no row, if an a,yes exists. The solution should be usable even with more iterations like here:

df = pl.DataFrame({'name': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
                   'found': ['yes', 'no', 'yes', 'no', 'no', 'no', 'no', 'yes', 'no'],
                   'found_in_iteration': ['1', 'not_found', '3', 'not_found', 'not_found', 'not_found', 'not_found', '2', 'not_found']})

Here, a was found in two iterations, b was found not at all and c was found in one. Here for a, both columns with found == yes should be kept, for c, the one column with found == yes should be kept and for b one column with found == no should be kept. So this:

shape: (9, 3)
┌──────┬───────┬────────────────────┐
│ name ┆ found ┆ found_in_iteration │
│ ---  ┆ ---   ┆ ---                │
│ str  ┆ str   ┆ str                │
╞══════╪═══════╪════════════════════╡
│ a    ┆ yes   ┆ 1                  │
│ a    ┆ no    ┆ not_found          │
│ a    ┆ yes   ┆ 3                  │
│ b    ┆ no    ┆ not_found          │
│ b    ┆ no    ┆ not_found          │
│ b    ┆ no    ┆ not_found          │
│ c    ┆ no    ┆ not_found          │
│ c    ┆ yes   ┆ 2                  │
│ c    ┆ no    ┆ not_found          │
└──────┴───────┴────────────────────┘

should be reduced to this:

shape: (9, 3)
┌──────┬───────┬────────────────────┐
│ name ┆ found ┆ found_in_iteration │
│ ---  ┆ ---   ┆ ---                │
│ str  ┆ str   ┆ str                │
╞══════╪═══════╪════════════════════╡
│ a    ┆ yes   ┆ 1                  │
│ a    ┆ yes   ┆ 3                  │
│ b    ┆ no    ┆ not_found          │
│ c    ┆ yes   ┆ 2                  │
└──────┴───────┴────────────────────┘
like image 774
gernophil Avatar asked Nov 01 '25 14:11

gernophil


1 Answers

Here's one approach:

out = (
    df.unique(maintain_order=True)
      .filter(
          pl.col("found").eq('yes') | pl.col("name").is_unique()
          )
      )

Output:

shape: (4, 3)
┌──────┬───────┬────────────────────┐
│ name ┆ found ┆ found_in_iteration │
│ ---  ┆ ---   ┆ ---                │
│ str  ┆ str   ┆ str                │
╞══════╪═══════╪════════════════════╡
│ a    ┆ yes   ┆ 1                  │
│ a    ┆ yes   ┆ 3                  │
│ b    ┆ no    ┆ not_found          │
│ c    ┆ yes   ┆ 2                  │
└──────┴───────┴────────────────────┘

Explanation

  • Use df.unique to drop duplicates (with maintain_order=True if important).
  • And chain .filter to keep rows where 'found' equals 'yes' (Expr.eq) or 'name' is unique (Expr.is_unique).
like image 197
ouroboros1 Avatar answered Nov 03 '25 04:11

ouroboros1



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!