Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting pandas df.duplicated function into Polars equivalent

This is the code , here i'm calculating a column named Total_SKU_Count, i want the column value as 1 for first occurrences of unique subset else 0.

Let's consider this dummy data set,

import pandas as pd
    
data = {
    'store': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'B', 'C'],
    'item': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y', 'X'],
    'date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-01', '2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-02'],
    'Sold_Qty_LY': [10, 20, 15, 25, 30, 40, 10, 15, 20],
}

df = pd.DataFrame(data)

Now for the below subset

subset = ['store', 'item', 'date']
df['Total_SKU_Count'] = (~df.duplicated(subset=subset, keep='first')).astype(int)

This is the output

  store item        date  Sold_Qty_LY  Total_SKU_Count
0     A    X  2023-01-01           10                1
1     A    Y  2023-01-01           20                1
2     B    X  2023-01-01           15                1
3     B    Y  2023-01-01           25                1
4     C    X  2023-01-01           30                1
5     C    Y  2023-01-01           40                1
6     A    X  2023-01-02           10                1
7     B    Y  2023-01-02           15                1
8     C    X  2023-01-02           20                1

For this one,

subset = ['store', 'item']
df['Total_SKU_Count'] = (~df.duplicated(subset=subset, keep='first')).astype(int)

This is the output

  store item        date  Sold_Qty_LY  Total_SKU_Count
0     A    X  2023-01-01           10                1
1     A    Y  2023-01-01           20                1
2     B    X  2023-01-01           15                1
3     B    Y  2023-01-01           25                1
4     C    X  2023-01-01           30                1
5     C    Y  2023-01-01           40                1
6     A    X  2023-01-02           10                0
7     B    Y  2023-01-02           15                0
8     C    X  2023-01-02           20                0

If you observe the later output you can clearly understand the problem.

I've found this solution on stackoverflow:

  • Polars - drop duplicate row based on column subset but keep first

But in my case I can't drop any rows, and this is not useful for me. Since polars is new for me I'm having a hard time to solve this problem. Please help me if there is a way to achieve this. Your support is much appreciated.

like image 356
Akshay Avatar asked Nov 28 '25 17:11

Akshay


1 Answers

All you need is

dfpl.with_columns(Total_SKU_Count=pl.struct('store', 'item').is_first_distinct())

What this does:

  1. create a Struct column from 'store' and 'item'
  2. use is_first_distinct on that column

If you want your exact output, you may want to cast to pl.Int32 as well, i.e.

(
    dfpl.with_columns(
        Total_SKU_Count=pl.struct("store", "item").is_first_distinct().cast(pl.Int32)
    )
)
shape: (9, 5)
┌───────┬──────┬────────────┬─────────────┬─────────────────┐
│ store ┆ item ┆ date       ┆ Sold_Qty_LY ┆ Total_SKU_Count │
│ ---   ┆ ---  ┆ ---        ┆ ---         ┆ ---             │
│ str   ┆ str  ┆ str        ┆ i64         ┆ i32             │
╞═══════╪══════╪════════════╪═════════════╪═════════════════╡
│ A     ┆ X    ┆ 2023-01-01 ┆ 10          ┆ 1               │
│ A     ┆ Y    ┆ 2023-01-01 ┆ 20          ┆ 1               │
│ B     ┆ X    ┆ 2023-01-01 ┆ 15          ┆ 1               │
│ B     ┆ Y    ┆ 2023-01-01 ┆ 25          ┆ 1               │
│ C     ┆ X    ┆ 2023-01-01 ┆ 30          ┆ 1               │
│ C     ┆ Y    ┆ 2023-01-01 ┆ 40          ┆ 1               │
│ A     ┆ X    ┆ 2023-01-02 ┆ 10          ┆ 0               │
│ B     ┆ Y    ┆ 2023-01-02 ┆ 15          ┆ 0               │
│ C     ┆ X    ┆ 2023-01-02 ┆ 20          ┆ 0               │
└───────┴──────┴────────────┴─────────────┴─────────────────┘
like image 151
ignoring_gravity Avatar answered Dec 01 '25 08:12

ignoring_gravity