Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi filter by 2 columns and display largest results with Polars

I have df for my work with 3 main columns: cid1, cid2, cid3, and more columns cid4, cid5, etc. cid1 and cid2 is int, another columns is float.

import polars as pl

df = pl.from_repr("""
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪══════╡
│ 1    ┆ 5    ┆ 1.0  ┆ 4.0  ┆ 4.0  ┆ 1.0  │
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0  │
│ 1    ┆ 5    ┆ 9.0  ┆ 6.0  ┆ 4.0  ┆ 9.0  │
│ 3    ┆ 7    ┆ 1.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
│ 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
│ 3    ┆ 7    ┆ 8.0  ┆ 8.0  ┆ 3.0  ┆ 1.0  │
└──────┴──────┴──────┴──────┴──────┴──────┘
""")

Each combination of cid1 and cid2 is a workset for analysis and for each workset I have some values cid3.

I can take df with only maximal values of cid3:

df.filter(pl.col("cid3") == pl.col("cid3").max().over("cid1", "cid2"))
shape: (2, 6)
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪══════╡
│ 1    ┆ 5    ┆ 9.0  ┆ 6.0  ┆ 4.0  ┆ 9.0  │
│ 3    ┆ 7    ┆ 8.0  ┆ 8.0  ┆ 3.0  ┆ 1.0  │
└──────┴──────┴──────┴──────┴──────┴──────┘

But I would like to take two maximal values of cid3 for each workset for this result:

shape: (4, 6)
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪══════╡
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0  │
│ 1    ┆ 5    ┆ 9.0  ┆ 6.0  ┆ 4.0  ┆ 9.0  │
│ 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
│ 3    ┆ 7    ┆ 8.0  ┆ 8.0  ┆ 3.0  ┆ 1.0  │
└──────┴──────┴──────┴──────┴──────┴──────┘

(Two maximal values of cid3 is an example, for my actual task I want 10 maximal values and 5 minimal values.)

like image 218
Jahspear Avatar asked Jan 18 '26 00:01

Jahspear


2 Answers

Here is one more possibility in case you want to get maximum or minimum values

Getting 2 largest values

df.filter(
    pl.col("cid3").is_in(pl.col("cid3").unique().sort(descending=True).head(2))
    .over("cid1", "cid2")
    )
shape: (4, 6)
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪══════╡
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0  │
│ 1    ┆ 5    ┆ 9.0  ┆ 6.0  ┆ 4.0  ┆ 9.0  │
│ 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
│ 3    ┆ 7    ┆ 8.0  ┆ 8.0  ┆ 3.0  ┆ 1.0  │
└──────┴──────┴──────┴──────┴──────┴──────┘

Getting 2 smallest values

df.filter(
    pl.col("cid3").is_in(pl.col("cid3").unique().sort(descending=False).head(2))
    .over("cid1", "cid2")
    )
shape: (4, 6)
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪══════╡
│ 1    ┆ 5    ┆ 1.0  ┆ 4.0  ┆ 4.0  ┆ 1.0  │
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0  │
│ 3    ┆ 7    ┆ 1.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
│ 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
└──────┴──────┴──────┴──────┴──────┴──────┘
like image 97
Luca Avatar answered Jan 20 '26 14:01

Luca


You can use .top_k() to get the k largest (or smallest) values.

.unique().top_k() can be used if you need distinct values.

df.groupby("cid1", "cid2").agg(pl.col("cid3").top_k(2))
shape: (2, 3)
┌──────┬──────┬────────────┐
│ cid1 ┆ cid2 ┆ cid3       │
│ ---  ┆ ---  ┆ ---        │
│ i64  ┆ i64  ┆ list[f64]  │
╞══════╪══════╪════════════╡
│ 1    ┆ 5    ┆ [9.0, 2.0] │
│ 3    ┆ 7    ┆ [8.0, 3.0] │
└──────┴──────┴────────────┘

This can be used inside .filter combined with .is_in

df.filter(
   pl.col("cid3").is_in(pl.col("cid3").top_k(2))
     .over("cid1", "cid2")
)
shape: (4, 6)
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪══════╡
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0  │
│ 1    ┆ 5    ┆ 9.0  ┆ 6.0  ┆ 4.0  ┆ 9.0  │
│ 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
│ 3    ┆ 7    ┆ 8.0  ┆ 8.0  ┆ 3.0  ┆ 1.0  │
└──────┴──────┴──────┴──────┴──────┴──────┘

descending=True to find the minimal values (bottom_k)

Update: .bottom_k has been added and will be in the next release.

df.filter(
   pl.col("cid3").is_in(pl.col("cid3").bottom_k(2)
     .over("cid1", "cid2")
)
shape: (4, 6)
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪══════╡
│ 1    ┆ 5    ┆ 1.0  ┆ 4.0  ┆ 4.0  ┆ 1.0  │
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0  │
│ 3    ┆ 7    ┆ 1.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
│ 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
└──────┴──────┴──────┴──────┴──────┴──────┘

Dataframe used:

df = pl.read_csv(b"""

cid1,cid2,cid3,cid4,cid5,cid6
1,5,1.0,4.0,4.0,1.0
1,5,2.0,5.0,5.0,9.0
1,5,9.0,6.0,4.0,9.0
3,7,1.0,7.0,9.0,1.0
3,7,3.0,7.0,9.0,1.0
3,7,8.0,8.0,3.0,1.0

""")
like image 35
jqurious Avatar answered Jan 20 '26 14:01

jqurious



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!