Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster Sequential Joins

In a regular Polars join where there are duplicates, the result is the cartesian product of the matched rows.

However, I would like to join the dataframes such that if there are duplicates, the rows are matched in a sequential manner.

Example below:

dfA = pl.from_repr('''
┌───────┬─────┐
│ group ┆ A   │
│ ---   ┆ --- │
│ str   ┆ i64 │
╞═══════╪═════╡
│ X     ┆ 0   │
│ X     ┆ 1   │
│ Y     ┆ 2   │
│ Y     ┆ 3   │
└───────┴─────┘
''')

dfB = pl.from_repr('''
┌───────┬─────┐
│ group ┆ B   │
│ ---   ┆ --- │
│ str   ┆ i64 │
╞═══════╪═════╡
│ X     ┆ 4   │
│ X     ┆ 5   │
│ Y     ┆ 6   │
│ Y     ┆ 7   │
└───────┴─────┘
''')
Current implementation: dfA.join(dfB, on='group')
┌───────┬─────┬─────┐
│ group ┆ A   ┆ B   │
│ ---   ┆ --- ┆ --- │
│ str   ┆ i64 ┆ i64 │
╞═══════╪═════╪═════╡
│ X     ┆ 0   ┆ 4   │
│ X     ┆ 0   ┆ 5   │
│ X     ┆ 1   ┆ 4   │
│ X     ┆ 1   ┆ 5   │
│ Y     ┆ 2   ┆ 6   │
│ Y     ┆ 2   ┆ 7   │
│ Y     ┆ 3   ┆ 6   │
│ Y     ┆ 3   ┆ 7   │
└───────┴─────┴─────┘
Desired outcome: Sequential Left Join on 'group'
┌───────┬─────┬─────┐
│ group ┆ A   ┆ B   │
│ ---   ┆ --- ┆ --- │
│ str   ┆ i64 ┆ i64 │
╞═══════╪═════╪═════╡
│ X     ┆ 0   ┆ 4   │
│ X     ┆ 1   ┆ 5   │
│ Y     ┆ 2   ┆ 6   │
│ Y     ┆ 3   ┆ 7   │
└───────┴─────┴─────┘

I'm not sure if there's already a term for such a join, so I have termed it a Sequential Join.

Currently, to achieve the desired outcome, I create an 'index' column on both dataframes then join on both the group and index.

(
    dfA
    .with_columns(
        index = pl.int_range(0, pl.len()).over('group')
    )
    .join(
        dfB
        .with_columns(
            index = pl.int_range(0, pl.len()).over('group')
        ),
        on = ['group', 'index']
    )
)

However, this can be very slow when joining on multiple columns or if there are many distinct groups.

For example, my main use case is for joining large time-series datasets with duplicated timestamps, so doing an pl.int_range(0, pl.len()).over('timestamp') can be very slow. Hence, I'm wondering if there is a faster or better way to do this.

Otherwise, is there a way to add custom join logic in Polars?

like image 397
T.H Rice Avatar asked Nov 18 '25 17:11

T.H Rice


1 Answers

This assumes that the lengths of the groups match (the results are ill-defined otherwise anyway without further clarification):

pl.concat(
    [
        dfA.sort('group', maintain_order=True),
        dfB.sort('group', maintain_order=True).drop('group'),
    ],
    how="horizontal"
)
like image 115
orlp Avatar answered Nov 20 '25 06:11

orlp



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!