Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize BETWEEN date statement

I need help optimizing a Postgres query which uses the BETWEEN clause with a timestamp field.

I have 2 tables:

ONE(int id_one(PK), datetime cut_time, int f1 ...) 

containing about 3394 rows

TWO(int id_two(PK), int id_one(FK), int f2 ...) 

containing about 4000000 rows

There are btree indexes on both PKs id_one and id_two, on the FK id_one and cut_time.

I want to perform a query like:

select o.id_one, Date(o.cut_time), o.f1, t.f2 
from one o
inner join two t ON (o.id_one = t.id_one)
where o.cut_time between '2013-01-01' and '2013-01-31';

This query retrieves about 1.700.000 rows in about 7 seconds.

Below the explain analyze report is reported:

Merge Join  (cost=20000000003.53..20000197562.38 rows=1680916 width=24) (actual time=0.017..741.718 rows=1692345 loops=1)"
  Merge Cond: (c.coilid = hf.coilid)
  ->  Index Scan using pk_coils on coils c  (cost=10000000000.00..10000000382.13 rows=1420 width=16) (actual time=0.008..4.539 rows=1404 loops=1)
        Filter: ((cut_time >= '2013-01-01 00:00:00'::timestamp without time zone) AND (cut_time <= '2013-01-31 00:00:00'::timestamp without time zone))
        Rows Removed by Filter: 1990
  ->  Index Scan using idx_fk_lf_data on hf_data hf  (cost=10000000000.00..10000166145.90 rows=4017625 width=16) (actual time=0.003..392.535 rows=1963386 loops=1)
Total runtime: 768.473 ms

The index on the timestamp column isn't used. How to optimize this query?

like image 512
Nko Avatar asked Oct 25 '25 07:10

Nko


1 Answers

Proper DDL script

A proper setup could look like this:

db<>fiddle here
Old sqlfiddle

More about this fiddle further down.
Assuming data type timestamp for the column datetime.

Incorrect query

BETWEEN is almost always wrong on principal with timestamp columns. See:

  • Find overlapping date ranges in PostgreSQL

In your query:

SELECT o.one_id, date(o.cut_time), o.f1, t.f2 
FROM   one o
JOIN   two t USING (one_id)
WHERE  o.cut_time BETWEEN '2013-01-01' AND '2013-01-31';

... the string constants '2013-01-01' and '2013-01-31' are coerced to the timestamps '2013-01-01 00:00' and '2013-01-31 00:00'. This excludes most of Jan. 31. The timestamp '2013-01-31 12:00' would not qualify, which is most certainly wrong.
If you'd use '2013-02-01' as upper bound instead, it would include '2013-02-01 00:00'. Still wrong.

To get all timestamps of "January 2013" it needs to be:

SELECT o.one_id, date(o.cut_time), o.f1, t.f2 
FROM   one o
JOIN   two t USING (one_id)
WHERE  o.cut_time >= '2013-01-01'
AND    o.cut_time <  '2013-02-01';

Exclude the upper bound.

Optimize query

It's probably pointless to retrieve 1.7 million rows. Aggregate before you retrieve the result.

Since table two is so much bigger, it's crucial how many rows you get from there. When retrieving more than ~ 5 %, a plain index on two.one_id will typically not be used, because it is faster to scan the table sequentially right away.

Your table statistics are outdated, or you have messed with cost constants and other parameters (which you obviously have, see below) to force Postgres into using the index anyway.

The only chance I would see for an index on two is a covering index:

CREATE INDEX two_one_id_f2 ON two(one_id, f2);

This way, Postgres could read from the index directly, if some preconditions are met. Might be a bit faster, not much. Didn't test.

Strange numbers in EXPLAIN output

As to your strange numbers in your EXPLAIN ANALYZE. The fiddle should explain it.

Seems like you had these debug settings:

SET enable_seqscan = off;
SET enable_indexscan = off;
SET enable_bitmapscan = off;

All of them should be on (default setting), except for debugging. Else it cripples performance! Check with:

SELECT * FROM pg_settings WHERE name ~~ 'enable%';
like image 143
Erwin Brandstetter Avatar answered Oct 27 '25 22:10

Erwin Brandstetter