Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Time-based record grouping in SQL

I have a database table of user interactions. I would like to create groups of users based on the time & place of interaction. That is, if users are interacting at roughly the same time (e.g., 2 minute window) in the same location, I would consider them a group. The groups do not need to be mutually exclusive, but they do need to be exhaustive. Every user interaction belongs in one or more groups.

I've done something similar to this in the past with python and a disjoint set. But now I am limited to a SQL solution.

Assume a toy data table like

create table example_intxns (
  intxn_date DATE, 
  loc_id number,
  intxn_timestamp timestamp,
  user_id varchar(100)
);

insert into example_intxns (intxn_date, loc_id, intxn_timestamp, user_id)
values
('2021-01-01', 1, '2021-01-01 08:00:00', 'a'),
('2021-01-01', 1, '2021-01-01 08:01:00', 'b'),
('2021-01-01', 1, '2021-01-01 08:02:00', 'c'),
('2021-01-01', 1, '2021-01-01 08:04:00', 'd'),
('2021-01-01', 1, '2021-01-01 08:05:00', 'e'),
('2021-01-01', 1, '2021-01-01 08:07:00', 'f'),
('2021-01-01', 1, '2021-01-01 08:10:00', 'g'),
('2021-01-01', 1, '2021-01-01 08:02:00', 'h')
;

I can create pairs of users who interact within 2 minutes of each other like so

select distinct 
    a.intxn_date, 
    a.loc_id,    
    a.user_id as seed_user_id, 
    b.user_id

from 
    example_intxns a
inner join 
    example_intxns b
        on a.intxn_date = b.intxn_date
        and a.loc_id = b.loc_id
        and timestampdiff(minute, a.intxn_timestamp, b.intxn_timestamp) between -2 and 2
        and b.user

This returns a result set of all user pairs based on the same location and an interaction window of +/- 2 minutes. A user will always pair with themself- that's useful for when there is no one else interacting at the same time & place. And a user can pair with another user. Here, I use the b.user_id >= a.user_id condition because A->B is equivalent to B->A. I don't need both.

But here is where I'm stuck. I don't know how to extend pairs to groups in SQL-way. It feels like it could be a recursive problem?

I think what I want- and I'm still thinking through edge cases- is a result set that looks something like

Date Location GroupID GroupMember
2021-01-01 1 1 a
2021-01-01 1 1 b
2021-01-01 1 1 c
2021-01-01 1 1 h
2021-01-01 1 2 c
2021-01-01 1 2 h
2021-01-01 1 2 d
2021-01-01 1 3 d
2021-01-01 1 3 e
2021-01-01 1 4 e
2021-01-01 1 4 f
2021-01-01 1 5 g

This effectively tells me that on this date, at this location, I have 5 groups: [a,b,c,h], [c,h,d], [d,e], [e,f] and [g]

Complications for a solution include the same user interacting multiple times on the same day in the same location. Not present in the toy example, but possible in real data.

I'm working in Snowflake, if that makes a difference. The real world problem includes 10s of thousands of users every day, at a half-dozen locations, and ~5 years interactions.

EDIT: I moved the goal posts a bit on the description of what I wanted. My desired output reflected the need to exclude group occurrences that were a subset of an earlier group. That is, [b,c] is a valid group, but it's covered by the [a,b,c,h] group. But I didn't specifically call this criteria out. My final solution, which covers this additional requirement, is as follows:

with pairs as ( -- qualified pairs
select distinct 
    a.intxn_date, 
    a.loc_id,    
    a.user_id as seed_user_id,
    a.intxn_timestamp as seed_intxn_timestamp,
    b.user_id,
    b.intxn_timestamp as intxn_timestamp,
    dense_rank() over (partition by a.intxn_date, a.loc_id order by seed_user_id) as group_id

from 
    example_entries a
inner join 
    example_entries b
        on a.intxn_date = b.intxn_date
        and a.loc_id = b.loc_id
        and timestampdiff(second, a.intxn_timestamp, b.intxn_timestamp) between 0 and 120
        and b.user_id>=a.user_id
),
groups as ( -- qualified pairs converted into groups
select p.*,
    max(p.intxn_timestamp)over(partition by p.group_id) as max_group_intxn_timestamp
from pairs p
),
subtracts as ( -- groups already completely assumed in another, earlier gruop
select a.*
from groups a -- what we want to subtract
inner join groups b
  on a.seed_user_id = b.user_id
  and a.loc_id = b.loc_id
  and a.intxn_date = b.intxn_date
  and a.group_id > b.group_id
  and timestampdiff(second, a.seed_intxn_timestamp, b.intxn_timestamp) between 0 and 120
inner join groups c
  on b.loc_id = c.loc_id
  and b.intxn_date = c.intxn_date
  and b.seed_intxn_timestamp = c.seed_intxn_timestamp
  and b.group_id = c.group_id
  and c.user_id >= b.user_id
  and b.seed_user_id = c.seed_user_id
  and a.user_id = c.user_id
  and a.max_group_intxn_timestamp <= c.max_group_intxn_timestamp
)
select distinct a.intxn_date, a.loc_id, a.group_id, a.user_id, a.seed_intxn_timestamp as group_intxn_window_start_timestamp, a.max_group_intxn_timestamp as group_intxn_window_end_timestamp
from groups a
left join (select distinct intxn_date, loc_id, seed_intxn_timestamp, group_id from subtracts) b
on a.intxn_date = b.intxn_date
and a.loc_id = b.loc_id
and a.seed_intxn_timestamp = b.seed_intxn_timestamp
and a.group_id = b.group_id
where b.group_id is null
/*minus -- could have used a MINUS, but I think the LEFT JOIN way is safer because it removes all traces of the unqualified group
select distinct intxn_date, loc_id, group_id, user_id, seed_intxn_timestamp
from subtracts a*/
order by 1,2,3,4
like image 837
Amw 5G Avatar asked Sep 14 '25 19:09

Amw 5G


1 Answers

I modified your query a little, and then got group ids using RANK():

select *, rank() over(partition by intxn_date order by seed_timestamp) grp_id
from (
    select a.intxn_date, a.loc_id, a.intxn_timestamp as seed_timestamp, b.user_id
    from  example_entries a
    inner join example_entries b
    on a.intxn_date = b.intxn_date
    and a.loc_id = b.loc_id
    and timestampdiff(minute, a.intxn_timestamp, b.intxn_timestamp) between 0 and 2
    and iff(a.intxn_timestamp=b.intxn_timestamp, a.user_id>=b.user_id, true)
)

enter image description here

The group ids are non-consecutive, but you can see that there are 7 different groups, with each group having at least one element. Some rows belong to many groups, but there's no row without a group.

like image 152
Felipe Hoffa Avatar answered Sep 17 '25 11:09

Felipe Hoffa