So, i have a table with roughly 1.5 million rows in it, looking somewhat like this:
name | time | data1 | data2
--------------------------------------
93-15 | 1337348782 | 11 | 60.791
92-02 | 1337348783 | 11 | 62.584
92-02 | 1337348056 | 11 | 63.281
93-15 | 1337348068 | 8 | 65.849
92-02 | 1337348117 | 11 | 63.271
93-15 | 1337348129 | 8 | 65.849
92-02 | 1337348176 | 10 | 63.258
93-15 | 1337348188 | 8 | 65.849
92-02 | 1337348238 | 10 | 63.245
93-15 | 1337348248 | 8 | 65.849
...these correspond to historical status updates from something that needs to be monitored. Now, what i would like to do is to find the current status if each unit.
It wasn't hard finding similar questions here on stackoverflow, and extrapolating from the findingsat, i came up with this query:
SELECT * FROM vehicles v
JOIN ( SELECT MAX(time) as max, name
FROM vehicles
GROUP BY name)
m_v
ON (v.time = m_v.max AND v.name = m_v.name);
but seeing as i have roughly 1.5 million rows (and counting), is there a different approach that allows for a faster query?
WITH
sequenced_data
AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY name ORDER BY time DESC) AS sequence_id,
*
FROM
vehicles
)
SELECT
*
FROM
sequenced_data
WHERE
sequence_id = 1
A covering index on (name, time) would be helpful too.
EDIT: Notes on how it work, etc.
PostgreSQL has what are know as windowing or analytical functions. These generally take the form some_function() OVER (PARTITION BY some_fields ORDER BY some_fields).
In this case I used ROW_NUMBER() OVER (PARTITION BY name ORDER BY time DESC).
ROW_NUMBER() creates a unique row number for a set of data. 1 to n for n records.
PARTITION BY name means that this function is applied independently to different names. Each name is it's own group/window/partition, and the results of ROW_NUMBER() start over from 1 again for each group/window/partition.
ORDER BY time DESC takes all the records in one group/window/partition and orders them by the time field, with the highest value first, before the ROW_NUMBER() function is applied.
For your example data, therefore, you get this...
name | time | data1 | data2 | row_number
--------------------------------------------------
92-02 | 1337348783 | 11 | 62.584 | 1
92-02 | 1337348238 | 10 | 63.245 | 2
92-02 | 1337348176 | 10 | 63.258 | 3
92-02 | 1337348117 | 11 | 63.271 | 4
92-02 | 1337348056 | 11 | 63.281 | 5
93-15 | 1337348782 | 11 | 60.791 | 1
93-15 | 1337348248 | 8 | 65.849 | 2
93-15 | 1337348188 | 8 | 65.849 | 3
93-15 | 1337348129 | 8 | 65.849 | 4
93-15 | 1337348068 | 8 | 65.849 | 5
Because the ordering is time DESC, the highest valued time field, in each name group/window/partition, will always have a row_number of 1.
Having an index on (name, time) makes it much easier for the optimiser by ensuring the data is in a friendly order. This means that ROW_NUMBER() isn't actually applied to all the records; as soon as it finds the highest valued time record, and assigns ROW_NUMBER() = 1, it knows it can stop and move on to the next name.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With