Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do an as-of-join in SQL (Snowflake)?

I am looking to join two time-ordered tables, such that the events in table1 are matched to the "next" event in table2 (within the same user). I am using SQL / Snowflake for this.

For argument's sake table1 is "notification_clicked" events and table2 is "purchases"

This is one way to do it:

WITH partial_result AS (
    SELECT 
    userId, notificationId, notificationTimeStamp, transactionId, transactionTimeStamp
    FROM table1 CROSS JOIN table2 
    WHERE table1.userId = table2.userId 
    AND notificationTimeStamp <= transactionTimeStamp)

SELECT * 
   FROM partial_result 
   QUALIFY ROW_NUMBER() OVER(
       PARTITION BY userId, notificationId ORDER BY transactionTimeStamp ASC
   ) = 1

It is not super readable, but is this "the" way to do this?

like image 653
MYK Avatar asked Sep 05 '25 03:09

MYK


1 Answers

Edit: Since writing this, Snowflake introduced support for AsOf join syntax:

https://docs.snowflake.com/en/sql-reference/constructs/asof-join

If you're doing an AsOf join against small tables, you can use a regular Venn diagram type of join. If you're running it against large tables, a regular join will lead to an intermediate cardinality explosion before the filter.

For large tables, this is the highest performance approach I have to date. Rather than treating an AsOf join like a regular Venn diagram join, we can treat it like a special type of union between two tables with a filter that uses the information from that union. The sample SQL does the following:

  • Unions the A and B tables so that the Entity and Time come from both tables and all other columns come from only one table. Rows from the other table specify NULL for these values (measures 1 and 2 in this case). It also projects a source column for the table. We'll use this later.
  • In the unioned table, it uses a LAG function on windows partitioned by the Entity and ordered by the Time. For each row with a source indicator from the A table, it lags back to the first Time with source in the B table, ignoring all values in the A table.
with A as 
(
    select 
    COLUMN1::int as "E",     -- Entity
    COLUMN2::int as "T",     -- Time
    COLUMN4::string as "M1"  -- Measure (could be many)
    from (values
        (1,  7, 1, 'M1-1'),
        (1,  8, 1, 'M1-2'),
        (1, 41, 1, 'M1-3'),
        (1, 89, 1, 'M1-4')
    )
), B as
(
    select 
    COLUMN1::int as "E",     -- Entity
    COLUMN2::int as "T",     -- Time
    COLUMN4::string as "M2"  -- Different measure (could be many)
    from (values
        (1,  6, 1, 'M2-1'),
        (1, 12, 1, 'M2-2'),
        (1, 20, 1, 'M2-3'),
        (1, 35, 1, 'M2-4'),
        (1, 57, 1, 'M2-5'),
        (1, 85, 1, 'M2-6'),
        (1, 92, 1, 'M2-7')
    )
), UNIONED as               -- Unify schemas and union all
(
select   'A'           as SOURCE_TABLE -- Project the source table
        ,E             as AB_E         -- AB_ means it's unified
        ,T             as AB_T
        ,M1            as A_M1         -- A_ means it's from A
        ,NULL::string  as B_M2         -- Make columns from B null for A
from A
    union all
select   'B'            as SOURCE_TABLE
        ,E              as AB_E
        ,T              as AB_T
        ,NULL::string   as A_M1 -- Make columns from A null for B
        ,M2             as B_M2
from B
)
select   AB_E   as ENTITY
        ,AB_T   as A_TIME
        ,lag(iff(SOURCE_TABLE = 'A', null, AB_T)) -- Lag back to  
           ignore nulls over                      -- previous B row
             (partition by AB_E order by AB_T) as B_TIME
        ,A_M1   as M1_FROM_A
        ,lag(B_M2)      -- Lag back to the previous non-null row.
           ignore nulls -- The A sourced rows will already be NULL.
             over (partition by AB_E order by AB_T) as M2_FROM_B
from UNIONED
qualify SOURCE_TABLE = 'A'
;

This will perform orders of magnitude faster for large tables because the highest intermediate cardinality is guaranteed to be the cardinality of A + B.

To simplify this refactor, I wrote a stored procedure that generates the SQL given the paths to table A and B, the entity column in A and B (right now limited to one, but if you have more it will get the SQL started), the order by (time) column in A and B, and finally the list of columns to "drag through" the AsOf join. It's rather lengthy so I posted it on Github and will work later to document and enhance it:

https://github.com/GregPavlik/AsOfJoin/blob/main/StoredProcedure.sql

like image 110
Greg Pavlik Avatar answered Sep 07 '25 22:09

Greg Pavlik