I am looking to join two time-ordered tables, such that the events in table1 are matched to the "next" event in table2 (within the same user). I am using SQL / Snowflake for this.
For argument's sake table1 is "notification_clicked" events and table2 is "purchases"
This is one way to do it:
WITH partial_result AS (
    SELECT 
    userId, notificationId, notificationTimeStamp, transactionId, transactionTimeStamp
    FROM table1 CROSS JOIN table2 
    WHERE table1.userId = table2.userId 
    AND notificationTimeStamp <= transactionTimeStamp)
SELECT * 
   FROM partial_result 
   QUALIFY ROW_NUMBER() OVER(
       PARTITION BY userId, notificationId ORDER BY transactionTimeStamp ASC
   ) = 1
It is not super readable, but is this "the" way to do this?
Edit: Since writing this, Snowflake introduced support for AsOf join syntax:
https://docs.snowflake.com/en/sql-reference/constructs/asof-join
If you're doing an AsOf join against small tables, you can use a regular Venn diagram type of join. If you're running it against large tables, a regular join will lead to an intermediate cardinality explosion before the filter.
For large tables, this is the highest performance approach I have to date. Rather than treating an AsOf join like a regular Venn diagram join, we can treat it like a special type of union between two tables with a filter that uses the information from that union. The sample SQL does the following:
with A as 
(
    select 
    COLUMN1::int as "E",     -- Entity
    COLUMN2::int as "T",     -- Time
    COLUMN4::string as "M1"  -- Measure (could be many)
    from (values
        (1,  7, 1, 'M1-1'),
        (1,  8, 1, 'M1-2'),
        (1, 41, 1, 'M1-3'),
        (1, 89, 1, 'M1-4')
    )
), B as
(
    select 
    COLUMN1::int as "E",     -- Entity
    COLUMN2::int as "T",     -- Time
    COLUMN4::string as "M2"  -- Different measure (could be many)
    from (values
        (1,  6, 1, 'M2-1'),
        (1, 12, 1, 'M2-2'),
        (1, 20, 1, 'M2-3'),
        (1, 35, 1, 'M2-4'),
        (1, 57, 1, 'M2-5'),
        (1, 85, 1, 'M2-6'),
        (1, 92, 1, 'M2-7')
    )
), UNIONED as               -- Unify schemas and union all
(
select   'A'           as SOURCE_TABLE -- Project the source table
        ,E             as AB_E         -- AB_ means it's unified
        ,T             as AB_T
        ,M1            as A_M1         -- A_ means it's from A
        ,NULL::string  as B_M2         -- Make columns from B null for A
from A
    union all
select   'B'            as SOURCE_TABLE
        ,E              as AB_E
        ,T              as AB_T
        ,NULL::string   as A_M1 -- Make columns from A null for B
        ,M2             as B_M2
from B
)
select   AB_E   as ENTITY
        ,AB_T   as A_TIME
        ,lag(iff(SOURCE_TABLE = 'A', null, AB_T)) -- Lag back to  
           ignore nulls over                      -- previous B row
             (partition by AB_E order by AB_T) as B_TIME
        ,A_M1   as M1_FROM_A
        ,lag(B_M2)      -- Lag back to the previous non-null row.
           ignore nulls -- The A sourced rows will already be NULL.
             over (partition by AB_E order by AB_T) as M2_FROM_B
from UNIONED
qualify SOURCE_TABLE = 'A'
;
This will perform orders of magnitude faster for large tables because the highest intermediate cardinality is guaranteed to be the cardinality of A + B.
To simplify this refactor, I wrote a stored procedure that generates the SQL given the paths to table A and B, the entity column in A and B (right now limited to one, but if you have more it will get the SQL started), the order by (time) column in A and B, and finally the list of columns to "drag through" the AsOf join. It's rather lengthy so I posted it on Github and will work later to document and enhance it:
https://github.com/GregPavlik/AsOfJoin/blob/main/StoredProcedure.sql
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With