I am having problems optimizing a query in PostgreSQL 9.5.14.
select *
from file as f
join product_collection pc on (f.product_collection_id = pc.id)
where pc.mission_id = 7
order by f.id asc
limit 100;
Takes about 100 seconds. If I drop the limit clause it takes about 0.5:
With limit:
explain (analyze,buffers) ... -- query exactly as above
                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.84..859.32 rows=100 width=457) (actual time=102793.422..102856.884 rows=100 loops=1)
   Buffers: shared hit=222430592
   ->  Nested Loop  (cost=0.84..58412343.43 rows=6804163 width=457) (actual time=102793.417..102856.872 rows=100 loops=1)
         Buffers: shared hit=222430592
         ->  Index Scan using file_pkey on file f  (cost=0.57..23409008.61 rows=113831736 width=330) (actual time=0.048..28207.152 rows=55858772 loops=1)
               Buffers: shared hit=55652672
         ->  Index Scan using product_collection_pkey on product_collection pc  (cost=0.28..0.30 rows=1 width=127) (actual time=0.001..0.001 rows=0 loops=55858772)
               Index Cond: (id = f.product_collection_id)
               Filter: (mission_id = 7)
               Rows Removed by Filter: 1
               Buffers: shared hit=166777920
 Planning time: 0.803 ms
 Execution time: 102856.988 ms
Without limit:
=>  explain (analyze,buffers)  ... -- query as above, just without limit
                                                                                 QUERY PLAN                                                                                 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=20509671.01..20526681.42 rows=6804163 width=457) (actual time=456.175..510.596 rows=142055 loops=1)
   Sort Key: f.id
   Sort Method: quicksort  Memory: 79392kB
   Buffers: shared hit=37956
   ->  Nested Loop  (cost=0.84..16494851.02 rows=6804163 width=457) (actual time=0.044..231.051 rows=142055 loops=1)
         Buffers: shared hit=37956
         ->  Index Scan using product_collection_mission_id_index on product_collection pc  (cost=0.28..46.13 rows=87 width=127) (actual time=0.017..0.101 rows=87 loops=1)
               Index Cond: (mission_id = 7)
               Buffers: shared hit=10
         ->  Index Scan using file_product_collection_id_index on file f  (cost=0.57..187900.11 rows=169535 width=330) (actual time=0.007..1.335 rows=1633 loops=87)
               Index Cond: (product_collection_id = pc.id)
               Buffers: shared hit=37946
 Planning time: 0.807 ms
 Execution time: 569.865 ms
I have copied the database to a backup server so that I may safely manipulate the database without something else changing it on me.
Cardinalities:
Table file: 113,831,736 rows.
Table product_collection: 1370 rows.
The query without LIMIT: 142,055 rows.
SELECT count(*) FROM product_collection WHERE mission_id = 7: 87 rows.
What I have tried:
None of them seem to change the performance in a significant way.
Thoughts?
UPDATE from OP:
Tested this on PostgreSQL 9.6 & 10.4, and found no significant changes in plans or performance.
However, setting random_page_cost low enough is the only way to get faster performance on the without limit search.
With a default random_page_cost = 4, the without limit:
                                                                           QUERY PLAN                                                                           
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=9270013.01..9287875.64 rows=7145054 width=457) (actual time=47782.523..47843.812 rows=145697 loops=1)
   Sort Key: f.id
   Sort Method: external sort  Disk: 59416kB
   Buffers: shared hit=3997185 read=1295264, temp read=7427 written=7427
   ->  Hash Join  (cost=24.19..6966882.72 rows=7145054 width=457) (actual time=1.323..47458.767 rows=145697 loops=1)
         Hash Cond: (f.product_collection_id = pc.id)
         Buffers: shared hit=3997182 read=1295264
         ->  Seq Scan on file f  (cost=0.00..6458232.17 rows=116580217 width=330) (actual time=0.007..17097.581 rows=116729984 loops=1)
               Buffers: shared hit=3997169 read=1295261
         ->  Hash  (cost=23.08..23.08 rows=89 width=127) (actual time=0.840..0.840 rows=87 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 15kB
               Buffers: shared hit=13 read=3
               ->  Bitmap Heap Scan on product_collection pc  (cost=4.97..23.08 rows=89 width=127) (actual time=0.722..0.801 rows=87 loops=1)
                     Recheck Cond: (mission_id = 7)
                     Heap Blocks: exact=10
                     Buffers: shared hit=13 read=3
                     ->  Bitmap Index Scan on product_collection_mission_id_index  (cost=0.00..4.95 rows=89 width=0) (actual time=0.707..0.707 rows=87 loops=1)
                           Index Cond: (mission_id = 7)
                           Buffers: shared hit=3 read=3
 Planning time: 0.929 ms
 Execution time: 47911.689 ms
User Erwin's answer below will take me some time to fully understand and generalize to all of the use cases needed. In the mean time we will probably use either a materialized view or just flatten our table structure.
This query is harder for the Postgres query planner than it might look. Depending on cardinalities, data distribution, value frequencies, sizes, ... completely different query plans can prevail and the planner has a hard time predicting which is best. Current versions of Postgres are better at this in several aspects, but it's still hard to optimize.
Since you retrieve only relatively few rows from product_collection, this equivalent query with LIMIT in a LATERAL subquery should avoid performance degradation:
SELECT *
FROM   product_collection pc
CROSS  JOIN LATERAL (
   SELECT *
   FROM   file f  -- big table
   WHERE  f.product_collection_id = pc.id
   ORDER  BY f.id
   LIMIT  100
   ) f
WHERE  pc.mission_id = 7
ORDER  BY f.id
LIMIT  100;
Edit: This results in a query plan with explain (analyze,verbose) provided  by the OP:
                                                                         QUERY PLAN                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=30524.34..30524.59 rows=100 width=457) (actual time=13.128..13.167 rows=100 loops=1)
   Buffers: shared hit=3213
   ->  Sort  (cost=30524.34..30546.09 rows=8700 width=457) (actual time=13.126..13.152 rows=100 loops=1)
         Sort Key: file.id
         Sort Method: top-N heapsort  Memory: 76kB
         Buffers: shared hit=3213
         ->  Nested Loop  (cost=0.57..30191.83 rows=8700 width=457) (actual time=0.060..9.868 rows=2880 loops=1)
               Buffers: shared hit=3213
               ->  Seq Scan on product_collection pc  (cost=0.00..69.12 rows=87 width=127) (actual time=0.024..0.336 rows=87 loops=1)
                     Filter: (mission_id = 7)
                     Rows Removed by Filter: 1283
                     Buffers: shared hit=13
               ->  Limit  (cost=0.57..344.24 rows=100 width=330) (actual time=0.008..0.071 rows=33 loops=87)
                     Buffers: shared hit=3200
                         ->  Index Scan using file_pc_id_index on file  (cost=0.57..582642.42 rows=169535 width=330) (actual time=0.007..0.065 rows=33 loops=87)
                           Index Cond: (product_collection_id = pc.id)
                           Buffers: shared hit=3200
 Planning time: 0.595 ms
 Execution time: 13.319 ms
You need these indexes (will help your original query, too):
CREATE INDEX idx1 ON file (product_collection_id, id);     -- crucial
CREATE INDEX idx2 ON product_collection (mission_id, id);  -- helpful
You mentioned:
two column indexes on
file.id&file.product_collection_id.
Etc. But we need it the other way round: id last. The order of index expressions is crucial. See:
Rationale: With only 87 rows from product_collection, we only fetch a maximum of 87 x 100 = 8700 rows (fewer if not every pc.id has 100 rows in table file), which are then sorted before picking the top 100. Performance degrades with the number of rows you get from product_collection and with bigger LIMIT.
With the multicolumn index idx1 above, that's 87 fast index scans. The rest is not very expensive.
More optimization is possible, depending on additional information. Related:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With