I am trying to improve the performance of a SQL query on a Postgres 9.4 database. I managed to re-write the query so that it would use indexes and it is superfast now! But I do not quite understand why.
This is the original query:
SELECT DISTINCT dt.id, dt.updated_at
FROM public.day dt
INNER JOIN public.optimized_localized_day sldt ON sldt.day_id = dt.id
INNER JOIN public.day_template_place dtp ON dtp.day_template_id = dt.id
INNER JOIN public.optimized_place op ON op.geoname_id = dtp.geoname_id
WHERE
op.alternate_localized_names ILIKE unaccent('%query%') OR
lower(sldt.unaccent_title) LIKE unaccent(lower('%query%')) OR
lower(sldt.unaccent_description) LIKE unaccent(lower('%query%'))
ORDER BY dt.updated_at DESC
LIMIT 100;
I have placed 3 trigram indexes using pg_trgm on op.alternate_localized_names
, lower(sldt.unaccent_title)
and lower(sldt.unaccent_description)
.
But, Postgres is not using them, instead it performs a SeqScan on the full tables to join them as shown by EXPLAIN
:
Limit
-> Unique
-> Sort
Sort Key: dt.updated_at, dt.id
-> Hash Join
Hash Cond: (sldt.day_id = dt.id)
Join Filter: ((op.alternate_localized_names ~~* unaccent('%query%'::text)) OR (lower(sldt.unaccent_title) ~~ unaccent('%query%'::text)) OR (lower(sldt.unaccent_description) ~~ unaccent('%query%'::text)))
-> Seq Scan on optimized_localized_day sldt
-> Hash
-> Hash Join
Hash Cond: (dtp.geoname_id = op.geoname_id)
-> Hash Join
Hash Cond: (dtp.day_template_id = dt.id)
-> Seq Scan on day_template_place dtp
-> Hash
-> Seq Scan on day dt
-> Hash
-> Seq Scan on optimized_place op
However, when I split the query in 2, one to search on public.optimized_localized_day
and one on public.optimized_place
, it now uses their indexes:
SELECT DISTINCT dt.id, dt.updated_at
FROM public.day dt
INNER JOIN public.day_template_place dtp ON dtp.day_template_id = dt.id
INNER JOIN public.optimized_place op ON op.geoname_id = dtp.geoname_id
WHERE op.alternate_localized_names ILIKE unaccent('%query%')
UNION
SELECT DISTINCT dt.id, dt.updated_at
FROM public.day dt
INNER JOIN public.optimized_localized_day sldt ON sldt.day_id = dt.id
WHERE lower(sldt.unaccent_title) LIKE unaccent(lower('%query%'))
OR lower(sldt.unaccent_description) LIKE unaccent(lower('%query%'));
And the EXPLAIN
:
HashAggregate
-> Append
-> HashAggregate
-> Nested Loop
-> Nested Loop
-> Bitmap Heap Scan on optimized_place op
Recheck Cond: (alternate_localized_names ~~* unaccent('%query%'::text))
-> Bitmap Index Scan on idx_trgm_place_lower
Index Cond: (alternate_localized_names ~~* unaccent('%jericho%'::text))
-> Bitmap Heap Scan on day_template_place dtp
Recheck Cond: (geoname_id = op.geoname_id)
-> Bitmap Index Scan on day_template_place_geoname_idx
Index Cond: (geoname_id = op.geoname_id)
-> Index Scan using day_pkey on day dt
Index Cond: (id = dtp.day_template_id)
-> HashAggregate
-> Nested Loop
-> Bitmap Heap Scan on optimized_localized_day sldt
Recheck Cond: ((lower(unaccent_title) ~~ unaccent('%query%'::text)) OR (lower(unaccent_description) ~~ unaccent('%query%'::text)))
-> BitmapOr
-> Bitmap Index Scan on tgrm_idx_localized_day_title
Index Cond: (lower(unaccent_title) ~~ unaccent('%query%'::text))
-> Bitmap Index Scan on tgrm_idx_localized_day_description
Index Cond: (lower(unaccent_description) ~~ unaccent('%query%'::text))
-> Index Scan using day_pkey on day dt_1
Index Cond: (id = sldt.day_id)
From what I understand, having conditions on 2 separate tables in an OR
clause causes Postgres to join the tables first and then filter them. But I am not sure about this. Second thing that puzzles me, I would like to understand how Postgres manages the filtering in the second query.
Do you guys know how Postgres handles those 2 cases ? Thanks :)
The transformation of the original query to the UNION
cannot be made automatically.
Consider a simplified case:
SELECT x.a, y.b
FROM x JOIN y USING (c)
WHERE x.a = 0 OR x.b = 0;
Imagine it has three result rows:
a | b
---+---
0 | 0
1 | 0
1 | 0
If you replace this with
SELECT x.a, y.b
FROM x JOIN y USING (c)
WHERE x.a = 0
UNION
SELECT x.a, y.b
FROM x JOIN y USING (c)
WHERE y.b = 0;
the result will only have two rows, because UNION
removes duplicates.
If you use UNION ALL
instead, the result will have four rows, because the row with the two zeros will appear twice, once from each branch of the query.
So this transformation cannot always be made safely. In your case, you can get away with it, because you remove duplicates anyway.
By the way: if you use UNION
, you don't need the DISTINCT
any more, because duplicates will be removed anyway. Your query will become cheaper if you remove the DISTINCT
s.
In the second branch of your second query, PostgreSQL can handle the OR
with index scans because the conditions are on the same table. In that case, PostgreSQL can perform a bitmap index scan:
The index is scanned, and PostgreSQL build a bitmap in memory that contains 1 for each table row where the index scan results in a match and 0 otherwise.
This bitmap is ordered in the physical order of the table rows.
The same thing happens for the other condition with the other index.
The resulting bitmaps are joined with a bit-wise OR
operation.
The resulting bitmap is used to fetch the matching rows from the table.
A trigram index is only a filter that can have false positive results, so the original condition has to be re-checked during that table scan.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With