explain analyze
shows that postgres will use index scanning for my query that fetches rows and performs filtering by date (i.e., 2017-04-14 05:27:51.039
):
explain analyze select * from tbl t where updated > '2017-04-14 05:27:51.039';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Scan using updated on tbl t (cost=0.43..7317.12 rows=10418 width=93) (actual time=0.011..0.515 rows=1179 loops=1)
Index Cond: (updated > '2017-04-14 05:27:51.039'::timestamp without time zone)
Planning time: 0.102 ms
Execution time: 0.720 ms
however running the same query but with different date filter '2016-04-14 05:27:51.039' shows that postgres will run the query using seq scan instead:
explain analyze select * from tbl t where updated > '2016-04-14 05:27:51.039';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Seq Scan on tbl t (cost=0.00..176103.94 rows=5936959 width=93) (actual time=0.008..2005.455 rows=5871963 loops=1)
Filter: (updated > '2016-04-14 05:27:51.039'::timestamp without time zone)
Rows Removed by Filter: 947
Planning time: 0.100 ms
Execution time: 2910.086 ms
How does postgres decide on what to use, specifically when performing filtering by date?
The Postgres query planner bases its decisions on cost estimates and column statistics, which are gathered by ANALYZE
and opportunistically by some other utility commands. That all happens automatically when autovacuum
is on (by default).
The manual:
Most queries retrieve only a fraction of the rows in a table, due to
WHERE
clauses that restrict the rows to be examined. The planner thus needs to make an estimate of the selectivity ofWHERE
clauses, that is, the fraction of rows that match each condition in theWHERE
clause. The information used for this task is stored in thepg_statistic
system catalog. Entries inpg_statistic
are updated by theANALYZE
andVACUUM ANALYZE
commands, and are always approximate even when freshly updated.
There is a row count (in pg_class
), a list of most common values, etc.
The more rows Postgres expects to find, the more likely it will switch to a sequential scan, which is cheaper to retrieve large portions of a table.
Generally, it's index scan -> bitmap index scan -> sequential scan, the more rows are expected to be retrieved.
For your particular example, the important statistic is histogram_bounds
, which give Postgres a rough idea how many rows have a greater value than the given one. There is the more convenient view pg_stats
for the human eye:
SELECT histogram_bounds
FROM pg_stats
WHERE tablename = 'tbl'
AND attname = 'updated';
There is a dedicated chapter explaining row estimation in the manual.
Obviously, optimization of queries is tricky. This answer is not intended to dive into the specifics of the Postgres optimizer. Instead, it is intended to give you some background on how the decision to use an index is made.
Your first query is estimated to return 10,418 rows. When using an index, the following operations happen:
In other words, there is a little bit of overhead when using the index -- initializing the index and then looking up each data page individually.
When the engine does a full table scan it:
There is no additional overhead. Further, the engine can "pre-load" the next pages to be scanned while processing the current page. This overlap of I/O and processing is a big win.
The point I'm trying to make is that getting the balance between these two can be tricky. Somewhere between 10,418 and 5,936,959, Postgres decides that the index overhead (and fetching the pages randomly) costs more than just scanning the whole table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With