Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keep only duplicated rows with a subset

I have a dataframe that I'd like to explore and look only the duplicated rows based on two or more columns.

For example:

df = pl.DataFrame({"A": [1, 6, 5, 4, 5, 6],
                   "B": ["A", "B", "C", "D", "C", "A"],
                   "C": [2, 2, 2, 1, 1, 1]})

I'd like to return duplicate combination for columns A and B only. I've tried:

df.filter(pl.col("A", "B").is_duplicated()) # Returns: This is ambiguous. Try to combine the predicates with the 'all' or `any' expression.

When adding .all() in between, the result is the same as above

df.filter(pl.col("A", "B").all().is_duplicated()) # Same as above

Unique with keep "none" returns the opposite result I'd like to have, so tried the below:

df.unique(subset=["A", "B"], keep="none").is_not() # 'DataFrame' object has no attribute 'is_not'

Expected output would be to see only the rows:

shape: (2, 3)
┌─────┬─────┬─────┐
│ A   | B   | C   │
│ --- | --- | --- │
│ i64 | str | i64 │
╞═════╪═════╪═════╡
│ 5   | C   | 2   │
│ 5   | C   | 1   │
└─────┴─────┴─────┘
like image 930
miroslaavi Avatar asked Oct 29 '25 19:10

miroslaavi


2 Answers

A struct allows you to "combine" multiple expressions together as a "single entity" (or "key").

Which you can use .is_duplicated() on.

df.filter(pl.struct("A", "B").is_duplicated())
shape: (2, 3)
┌─────┬─────┬─────┐
│ A   | B   | C   │
│ --- | --- | --- │
│ i64 | str | i64 │
╞═════╪═════╪═════╡
│ 5   | C   | 2   │
│ 5   | C   | 1   │
└─────┴─────┴─────┘

Why your approaches fail

We can use LazyFrame.explain() to inspect the query plan.

print(
   df.lazy().select(pl.col("A", "B").is_duplicated()).explain()
)
 SELECT [col("A").is_duplicated(), col("B").is_duplicated()] FROM
  DF ["A", "B", "C"]; PROJECT 2/3 COLUMNS; SELECTION: None

pl.col("A", "B").is_duplicated() expands in to 2 separate calls.

pl.col("A").is_duplicated()
pl.col("B").is_duplicated()

When you create a struct, .is_duplicated() is only called once.

df.select(pl.struct("A", "B"))
shape: (6, 1)
┌───────────┐
│ A         │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {1,"A"}   │
│ {6,"B"}   │
│ {5,"C"}   │
│ {4,"D"}   │
│ {5,"C"}   │
│ {6,"A"}   │
└───────────┘
like image 131
jqurious Avatar answered Oct 31 '25 09:10

jqurious


In the .is_duplicated() documentation we have the following example:

Examples

df = pl.DataFrame(
    {
        "a": [1, 2, 3, 1],
        "b": ["x", "y", "z", "x"],
    }
)
df.is_duplicated()
shape: (4,)
Series: '' [bool]
[
        true
        false
        false
        true
]

This mask can be used to visualize the duplicated lines like this:

df.filter(df.is_duplicated())
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ x   │
│ 1   ┆ x   │
└─────┴─────┘
like image 35
Igor Marcos Riegel Avatar answered Oct 31 '25 09:10

Igor Marcos Riegel