In a regular Polars join where there are duplicates, the result is the cartesian product of the matched rows.
However, I would like to join the dataframes such that if there are duplicates, the rows are matched in a sequential manner.
Example below:
dfA = pl.from_repr('''
┌───────┬─────┐
│ group ┆ A │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════╪═════╡
│ X ┆ 0 │
│ X ┆ 1 │
│ Y ┆ 2 │
│ Y ┆ 3 │
└───────┴─────┘
''')
dfB = pl.from_repr('''
┌───────┬─────┐
│ group ┆ B │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════╪═════╡
│ X ┆ 4 │
│ X ┆ 5 │
│ Y ┆ 6 │
│ Y ┆ 7 │
└───────┴─────┘
''')
Current implementation: dfA.join(dfB, on='group')
┌───────┬─────┬─────┐
│ group ┆ A ┆ B │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═══════╪═════╪═════╡
│ X ┆ 0 ┆ 4 │
│ X ┆ 0 ┆ 5 │
│ X ┆ 1 ┆ 4 │
│ X ┆ 1 ┆ 5 │
│ Y ┆ 2 ┆ 6 │
│ Y ┆ 2 ┆ 7 │
│ Y ┆ 3 ┆ 6 │
│ Y ┆ 3 ┆ 7 │
└───────┴─────┴─────┘
Desired outcome: Sequential Left Join on 'group'
┌───────┬─────┬─────┐
│ group ┆ A ┆ B │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═══════╪═════╪═════╡
│ X ┆ 0 ┆ 4 │
│ X ┆ 1 ┆ 5 │
│ Y ┆ 2 ┆ 6 │
│ Y ┆ 3 ┆ 7 │
└───────┴─────┴─────┘
I'm not sure if there's already a term for such a join, so I have termed it a Sequential Join.
Currently, to achieve the desired outcome, I create an 'index' column on both dataframes then join on both the group and index.
(
dfA
.with_columns(
index = pl.int_range(0, pl.len()).over('group')
)
.join(
dfB
.with_columns(
index = pl.int_range(0, pl.len()).over('group')
),
on = ['group', 'index']
)
)
However, this can be very slow when joining on multiple columns or if there are many distinct groups.
For example, my main use case is for joining large time-series datasets with duplicated timestamps, so doing an pl.int_range(0, pl.len()).over('timestamp') can be very slow. Hence, I'm wondering if there is a faster or better way to do this.
Otherwise, is there a way to add custom join logic in Polars?
This assumes that the lengths of the groups match (the results are ill-defined otherwise anyway without further clarification):
pl.concat(
[
dfA.sort('group', maintain_order=True),
dfB.sort('group', maintain_order=True).drop('group'),
],
how="horizontal"
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With