I track web visitors. I store the IP address as well as the timestamp of the visit.
ip_address    time_stamp
180.2.79.3  1301654105
180.2.79.3  1301654106
180.2.79.3  1301654354
180.2.79.3  1301654356
180.2.79.3  1301654358
180.2.79.3  1301654366
180.2.79.3  1301654368
180.2.79.3  1301654422
I have a query to get total tracks:
SELECT COUNT(*) AS tracks FROM tracking
However, I now want to disregard visits from users that have visited multiple times within 10 seconds of each visit. Since I don't consider this another visit, its still part of the first visit.
When the ip_address is the same, check timestamp and only count those rows that are 10 seconds away from each other.
I am having difficulty in putting this into a SQL query form, I would appreciate any help on this!
Let me start with this table. I'll use ordinary timestamps so we can easily see what's going on.
180.2.79.3   2011-01-01 08:00:00
180.2.79.3   2011-01-01 08:00:09
180.2.79.3   2011-01-01 08:00:20
180.2.79.3   2011-01-01 08:00:23
180.2.79.3   2011-01-01 08:00:25
180.2.79.3   2011-01-01 08:00:40
180.2.79.4   2011-01-01 08:00:00
180.2.79.4   2011-01-01 08:00:13
180.2.79.4   2011-01-01 08:00:23
180.2.79.4   2011-01-01 08:00:25
180.2.79.4   2011-01-01 08:00:27
180.2.79.4   2011-01-01 08:00:29
180.2.79.4   2011-01-01 08:00:50
If I understand you correctly, you want to count these like this.
180.2.79.3   3
180.2.79.4   3
You can do that for each ip_address by selecting the maximum timestamp that is both
Taking these two criteria together will introduce some nulls, which turn out to be really useful.
select ip_address, 
       t_s.time_stamp, 
       (select max(t.time_stamp) 
        from t_s t 
        where t.ip_address = t_s.ip_address 
          and t.time_stamp > t_s.time_stamp
          and t.time_stamp - t_s.time_stamp <= interval '10' second) next_page
from t_s 
group by ip_address, t_s.time_stamp
order by ip_address, t_s.time_stamp;
ip_address   time_stamp            next_page
180.2.79.3   2011-01-01 08:00:00   2011-01-01 08:00:09
180.2.79.3   2011-01-01 08:00:09   <null>
180.2.79.3   2011-01-01 08:00:20   2011-01-01 08:00:25
180.2.79.3   2011-01-01 08:00:23   2011-01-01 08:00:25
180.2.79.3   2011-01-01 08:00:25   <null>
180.2.79.3   2011-01-01 08:00:40   <null>
180.2.79.4   2011-01-01 08:00:00   <null>
180.2.79.4   2011-01-01 08:00:13   2011-01-01 08:00:23
180.2.79.4   2011-01-01 08:00:23   2011-01-01 08:00:29
180.2.79.4   2011-01-01 08:00:25   2011-01-01 08:00:29
180.2.79.4   2011-01-01 08:00:27   2011-01-01 08:00:29
180.2.79.4   2011-01-01 08:00:29   <null>
180.2.79.4   2011-01-01 08:00:50   <null>
The timestamp that marks the end of a visit has a null for its own next_page. That's because no timestamp is less than or equal to time_stamp + 10 seconds for that row.
To get a count, I'd probably create a view and count the nulls.
select ip_address, count(*)
from t_s_visits 
where next_page is null
group by ip_address
180.2.79.3   3
180.2.79.4   3
You could JOIN the tracking table to itself and filter out the records you don't need by adding a WHEREclause.
SELECT  t1.ip_address
        , COUNT(*) AS tracks
FROM    tracking t1
        LEFT OUTER JOIN tracking t2 ON t2.ip_address = t1.ip_address
                                       AND t2.time_stamp < t1.time_stamp + 10
WHERE   t2.ip_adress IS NULL
GROUP BY
        t1.ip_address
Edit
Following script works in SQL Server but I can't express it in a single SQL statement, let alone convert it to MySQL. It might give you some pointers on what is needed though.
Note: I assume for given inputs, number 1 and 11 should get chosen.
;WITH q (number) AS (
  SELECT 1
  UNION ALL SELECT 2
  UNION ALL SELECT 10
  UNION ALL SELECT 11  
  UNION ALL SELECT 12
)
SELECT  q1.Number as n1
        , q2.Number as n2
        , 0 as Done
INTO    #Temp
FROM    q q1
        LEFT OUTER JOIN q q2 ON q2.number < q1.number + 10
                                AND q2.number > q1.number
DECLARE @n1 INTEGER
DECLARE @n2 INTEGER
WHILE EXISTS (SELECT * FROM #Temp WHERE Done = 0)
BEGIN
  SELECT  TOP 1 @n1 = n1
          , @n2= n2
  FROM    #Temp
  WHERE   Done = 0
  DELETE  FROM #Temp
  WHERE   n1 = @n2
  UPDATE  #Temp 
  SET     Done = 1
  WHERE   n1 = @n1 
          AND n2 = @n2         
END        
SELECT  DISTINCT n1 
FROM    #Temp
DROP TABLE #Temp
The simplest way to do this is to divide the timestamps by 10, and count the distinct combinations of those values and the ip_address values. That way each 10 second period is counted separately.
If you run this on your sample data it will give you 4 tracks, which is what you want I think.
Give it a try and see if it gives you the desired results on your full data set:
SELECT COUNT(DISTINCT ip_address, FLOOR(time_stamp/10)) AS tracks 
FROM tracking
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With