I have a dataset i'm trying to remove duplicate entries from. The lazyframe i'm working with is structured like this:
df = pl.from_repr("""
┌──────┬────────────┬──────────────────┬───────┐
│ id   ┆ title      ┆ type             ┆ type2 │
│ ---  ┆ ---        ┆ ---              ┆ ---   │
│ i64  ┆ str        ┆ str              ┆ i64   │
╞══════╪════════════╪══════════════════╪═══════╡
│ 1001 ┆ Research A ┆ journal article  ┆ 35    │
│ 1002 ┆ Research B ┆ book chapter     ┆ 41    │
│ 1003 ┆ Research C ┆ journal article  ┆ 35    │
│ 1004 ┆ Research D ┆ conference paper ┆ 42    │
│ 1001 ┆ Research E ┆ journal article  ┆ 35    │
│ 1002 ┆ Research F ┆ journal article  ┆ 41    │
│ 1003 ┆ Research G ┆                  ┆ 41    │
│ 1002 ┆ Research I ┆ book chapter     ┆ 41    │
│ 1003 ┆ Research J ┆ journal article  ┆ 35    │
└──────┴────────────┴──────────────────┴───────┘
""")
I want to remove entries that have the same id, but there are actually different cases:
type (e.g. 1001): keep the first one.type: discard the ones with an empty string ("") as type, and then keep only the entries that respect the following pairs of type and type2:dict_df = pl.DataFrame({
    "type": ['journal article', 'book chapter', 'book chapter'],
    "type2": [35, 41, 42]
})
Expected output
| id[i64] | title[str] | type[str] | type2[i64] | 
|---|---|---|---|
| 1001 | Research A | journal article | 35 | 
| 1002 | Research B | book chapter | 41 | 
| 1003 | Research C | journal article | 35 | 
| 1004 | Research D | conference paper | 42 | 
I've tried many things, mainly using the pl.when() expression, but I cannot work out a way to filter the groups.
(
    df
    .sort('type', descending=True)
    .group_by("id")
    .agg([
        pl.when(pl.col("type").n_unique() > 1)
        .then( ... )
        .otherwise(pl.all().first())
    ])
)
preference = pl.DataFrame({
    "type": ["journal article", "book chapter", "book chapter"],
    "iris_type": [35, 41, 42],
    "preference": [0, 1, 2]
})
joined = df.lazy().join(preference.lazy(), on=["type", "iris_type"], how="left")
out = (
    joined.sort("preference", descending=True, nulls_last=True)
        .group_by("id")
        .first()
        .drop("preference")
        .collect()
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With