LIMIT clause in subquery with identical row count finishes faster than when omitted

Question

I have two identical queries to a PostGIS database, but one takes much more time to finish.

'table1' is a TimescaleDB hypertable and contains ~7 million rows, while table2 contains only 10 rows. These queries should return the same result. I have included the 'EXPLAIN ANALYZE' reports for both queries below

The first finishes much later:

SELECT 
    COUNT(DISTINCT d.id)
FROM 
    (SELECT * FROM table1 LIMIT 50000) d
JOIN 
    (SELECT * FROM table2) r
ON 
    ST_Within(d.geom, r.geom)
;

Aggregate  (cost=634043.20..634043.21 rows=1 width=8) (actual time=9482.297..9482.300 rows=1 loops=1)
  ->  Sort  (cost=634040.70..634041.95 rows=500 width=33) (actual time=9482.038..9482.144 rows=2553 loops=1)
        Sort Key: _hyper_4_2_chunk.id
        Sort Method: quicksort  Memory: 97kB
        ->  Nested Loop  (cost=0.21..634018.28 rows=500 width=33) (actual time=12.781..9476.510 rows=2553 loops=1)
              ->  Limit  (cost=0.08..503.98 rows=50000 width=73) (actual time=0.016..20.992 rows=50000 loops=1)
                    ->  Result  (cost=0.08..1923180.28 rows=190828000 width=73) (actual time=0.015..16.093 rows=50000 loops=1)
                          ->  Custom Scan (DecompressChunk) on _hyper_4_2_chunk  (cost=0.08..14900.28 rows=190828000 width=65) (actual time=0.014..9.731 rows=50000 loops=1)
                                ->  Seq Scan on compress_hyper_5_3_chunk  (cost=0.00..14900.28 rows=190828 width=69) (actual time=0.008..0.376 rows=1280 loops=1)
              ->  Index Scan using table2_geom_idx1 on table2 (cost=0.13..12.65 rows=1 width=2262228) (actual time=0.186..0.189 rows=0 loops=50000)
                    Index Cond: (geom ~ _hyper_4_2_chunk.geom)
                    Filter: st_within(_hyper_4_2_chunk.geom, geom)
                    Rows Removed by Filter: 0
Planning Time: 0.233 ms
Execution Time: 9482.367 ms

The second finishes in 250 milliseconds:

SELECT 
    COUNT(DISTINCT d.id)
FROM 
    (SELECT * FROM table1 LIMIT 50000) d
JOIN 
    (SELECT * FROM table2 LIMIT 10) r
ON 
    ST_Within(d.geom, r.geom)
;

Aggregate  (cost=6263276.10..6263276.11 rows=1 width=8) (actual time=240.348..240.352 rows=1 loops=1)
  ->  Sort  (cost=6263273.60..6263274.85 rows=500 width=33) (actual time=240.164..240.242 rows=2553 loops=1)
        Sort Key: d.id
        Sort Method: quicksort  Memory: 97kB
        ->  Nested Loop  (cost=0.08..6263251.18 rows=500 width=33) (actual time=24.089..239.821 rows=2553 loops=1)
              Join Filter: st_within(d.geom, table2.geom)
              Rows Removed by Join Filter: 497447
              ->  Limit  (cost=0.00..12.10 rows=10 width=2263078) (actual time=0.052..0.091 rows=10 loops=1)
                    ->  Seq Scan on table2  (cost=0.00..12.10 rows=10 width=2263078) (actual time=0.051..0.081 rows=10 loops=1)
              ->  Materialize  (cost=0.08..1839.98 rows=50000 width=65) (actual time=0.004..4.528 rows=50000 loops=10)
                    ->  Subquery Scan on d  (cost=0.08..1003.98 rows=50000 width=65) (actual time=0.013..10.659 rows=50000 loops=1)
                          ->  Limit  (cost=0.08..503.98 rows=50000 width=73) (actual time=0.012..8.109 rows=50000 loops=1)
                                ->  Result  (cost=0.08..1923180.28 rows=190828000 width=73) (actual time=0.012..5.850 rows=50000 loops=1)
                                      ->  Custom Scan (DecompressChunk) on _hyper_4_2_chunk  (cost=0.08..14900.28 rows=190828000 width=65) (actual time=0.011..3.284 rows=50000 loops=1)
                                            ->  Seq Scan on compress_hyper_5_3_chunk  (cost=0.00..14900.28 rows=190828 width=69) (actual time=0.005..0.155 rows=1280 loops=1)
Planning Time: 0.280 ms
Execution Time: 240.755 ms

I have added indices to both tables prior to running the above queries:

CREATE INDEX ON table1 (id, timestamp DESC);
CREATE INDEX ON table1 USING GIST (geom);
CREATE INDEX ON table2 USING GIST (geom);

I would like to understand why the first query exhibits different behavior from the second. Software versions follow:

PostgreSQL 17.4 (Ubuntu 17.4-1.pgdg24.04+2) on aarch64-unknown-linux-gnu
PostGIS 3.5
TimescaleDB 2.18.2

EDIT

Removing table2 indices as directed fixed the slowdown caused by the subqueries including table2, but now I am facing a similar issue with the subquery referencing table1:

FROM 
    (SELECT * FROM table1 LIMIT 50000) d

After setting the LIMIT to 7,000,000, more than the total number of rows in table1, the query will finish, but removing the LIMIT clause results in the query never finishing. Removing indices for table1 and running 'analyze' has not fixed this issue.

filiprem · Accepted Answer

To reduce this type of planner issue (choosing inappropriate index scan), make sure you always, regardless of your SQL queries, configure server resource usage parameters like shared_buffers, effective_cache_size, effective_io_concurrency and work_mem to reflect server hardware. Postgres defaults are tiny. One popular auto-configuration tool is PGTune.

If your config is fine, and queries are still slow, run database-wide VACUUM ANALYZE or at least make sure you did ANALYZE after substantial changes in row count (like, removing 75 % of rows or adding 500 %).

Only after that, you should go deeper, run EXPLAIN ANALYZE like you did here (which is great!). Query tuning is another (long) story.

In your case, the index use was indeed sub-optimal. But it's pretty likely it happened due to Postgres config defaults.

(if you can repeat the same "experiment" with updated postgres configs, please do!)

LIMIT clause in subquery with identical row count finishes faster than when omitted

Tags:

performance

sql

postgresql

postgis

timescaledb

orlb

1 Answers

filiprem

Recent Activity

Donate For Us

LIMIT clause in subquery with identical row count finishes faster than when omitted

Tags:

performance

sql

postgresql

postgis

timescaledb

orlb

1 Answers

filiprem

Related questions

Recent Activity

Donate For Us