I have the following sql statement, which works perfectly fine. I was hoping to see how this could be refactored so it doesn't require the use of RANK/PARTITION ... if possible.
SELECT LogEntryId, FileId, CreatedOn, EventTypeId
FROM (SELECT a.LogEntryId, a.FileId, a.CreatedOn, a.EventTypeId,
RANK() OVER (PARTITION BY ClientName ORDER BY a.CreatedOn DESC) AS MostRecentEventRank
FROM LogEntries a
WHERE (a.EventTypeId = 2 or a.EventTypeId = 4)) SubQuery
WHERE MostRecentEventRank = 1
What is it trying to do?
this in effect is grabbing the most recent event (for a connection or disconnection), for each unique user in the table.
I do like RANK/PARTITION, but i was hoping to see if it's possible to do without using it.
Yet another variation: select the clients, then use CROSS APPLY (.. TOP (1) ... ORDER BY ...) to get the relevant entry.
SELECT c.ClientName,r.LogEntryId, r.FileId, r.CreatedOn, r.EventTypeId
FROM (
SELECT DISTINCT ClientName
FROM LogEntries
WHERE EventTypeId IN (2,4)) as c
CROSS APPLY (
SELECT TOP (1) a.LogEntryId, a.FileId, a.CreatedOn, a.EventTypeId
FROM LogEntries as a
WHERE a.ClientName = c.ClientName
AND a.EventTypeId IN (2,4)
ORDER BY a.CreatedOn DESC) as r;
Update
To talk about performance on a T-SQL query without knowing the schema is non-sense. This query is perfectly optimal on a properly designed schema for its needs. Since the access is by ClientName and CreatedOn, then even a simplistic schema would need to take this into consideration:
CREATE TABLE LogEntries (
LogEntryId int identity(1,1),
FileID int,
CreatedOn datetime,
EventTypeID int,
ClientName varchar(30)
);
create clustered index cdxLogEntries on LogEntries (
ClientName, CreatedOn DESC);
go
And lets load the table with some 2.4M rows:
declare @i int;
set @i = 0;
while @i < 1000
begin
insert into LogEntries (FileId, CreatedOn, EventTypeId, ClientName)
select cast(rand()*100 as int),
dateadd(minute, -rand()*10000, getdate()),
cast(rand() * 5 as int),
'Client' + cast(@i as varchar(10))
from master..spt_values;
set @i = @i+1;
end
What time and IO do we get with set statistics io on; set statistics time on; on a warmed cache?
(410 row(s) affected)
Table 'LogEntries'. Scan count 411, logical reads 14354, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 1219 ms, elapsed time = 1932 ms.
1.9 sec to get the data from 2.4M entries on my laptop (which is 4 years old and has 1Gb RAM). And there is still plenty of room for improvement int he schema design. Separate ClientName into a normalized table with a trusted a foreign key from LogEntries into it would reduce the time significantly. Proper filtered indexes on EntryTypeId IN (2,4) also would contribute. We did not even start the exploration of parallelism posibilities.
This is SQL, performance is obtained on the drawing board of your schema, not in the text editor of your query.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With