Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently select rows having a MIN date in postgres

I need to quickly select a value ( baz ) from the "earliest" ( MIN(save_date) ) rows grouped by an their foo_id. The following query returns the correct rows (well almost, it can return multiples for each foo_id when there are duplicate save_dates).

The foos table contains about 55k rows and the samples table contains about 25 million rows.

CREATE TABLE foos (
    foo_id     int,
    val        varchar(40),
    # ref_id is a FK, constraint omitted for brevity
    ref_id     int
)
CREATE TABLE samples (
    sample_id  int,
    save_date  date,
    baz        smallint,
    # foo_id is a FK, constraint omitted for brevity
    foo_id     int
)

WITH foo ( foo_id, val ) AS (
        SELECT foo_id, val FROM foos
        WHERE foos.ref_id = 1
    ORDER BY foos.val ASC
    LIMIT 25 OFFSET 0
)
SELECT foo.val, firsts.baz
FROM foo
LEFT JOIN (
    SELECT A.baz, A.foo_id
    FROM samples A
    INNER JOIN (
        SELECT foo_id, MIN( save_date ) AS save_date
        FROM samples
        GROUP BY foo_id
    ) B
    USING ( foo_id, save_date )
) firsts USING ( foo_id )

This query currently takes over 100 seconds; I'd like to see this return in ~1 second (or less!).

How can I write this query to be optimal?


Updated; adding explains:

Obviously the actual query I'm using isn't using tables foo, baz, etc.

The "dumbed down" example query's (from above) explain:

Hash Right Join  (cost=337.69..635.47 rows=3 width=100)
  Hash Cond: (a.foo_id = foo.foo_id)
  CTE foo
    ->  Limit  (cost=71.52..71.53 rows=3 width=102)
          ->  Sort  (cost=71.52..71.53 rows=3 width=102)
                Sort Key: foos.val
                ->  Seq Scan on foos  (cost=0.00..71.50 rows=3 width=102)
                      Filter: (ref_id = 1)
  ->  Hash Join  (cost=265.25..562.90 rows=9 width=6)
        Hash Cond: ((a.foo_id = samples.foo_id) AND (a.save_date = (min(samples.save_date))))
        ->  Seq Scan on samples a  (cost=0.00..195.00 rows=1850 width=10)
        ->  Hash  (cost=244.25..244.25 rows=200 width=8)
              ->  HashAggregate  (cost=204.25..224.25 rows=200 width=8)
                    ->  Seq Scan on samples  (cost=0.00..195.00 rows=1850 width=8)
  ->  Hash  (cost=0.60..0.60 rows=3 width=102)
        ->  CTE Scan on foo  (cost=0.00..0.60 rows=3 width=102)
like image 901
David Murdoch Avatar asked Nov 29 '25 04:11

David Murdoch


1 Answers

If I understand the question, you want windowing.

WITH find_first AS (
  SELECT foo_id, baz,
    row_number()
  OVER (PARTITION BY foo_id ORDER BY foo_id, save_date) AS rnum
  FROM samples
)
SELECT foo_id, baz FROM find_first WHERE rnum = 1;

Using row_number instead of rank eliminates duplicates and guarantees only one baz per foo. If you need to know against foos that have no bazzes, just LEFT JOIN the foos table to this query.

With an index on (foo_id, save_date), the optimizer should be smart enough to do the grouping keeping only one baz and skipping merrily along.

like image 179
Andrew Lazarus Avatar answered Nov 30 '25 17:11

Andrew Lazarus



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!