Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slow regex query on 80M record in PostgreSQL

I have a read-only table with 80 million rows :

   Column    |          Type          | Modifiers | Storage  | Stats target | Description 
-------------+------------------------+-----------+----------+--------------+-------------
 id          | character(11)          | not null  | extended |              | 
 gender      | character(1)           |           | extended |              | 
 postal_code | character varying(10)  |           | extended |              | 
 operator    | character varying(5)   |           | extended |              | 

Indexes:
    "categorised_phones_pkey" PRIMARY KEY, btree (id)
    "operator_idx" btree (operator)
    "postal_code_trgm_idx" gin (postal_code gin_trgm_ops)

id is Primary Key and contains unique mobile numbers. Table rows looks like this:

      id        |     gender   |   postal_code  |   operator
----------------+--------------+----------------+------------
 09567849087    |      m       |   7414776788   |     mtn
 09565649846    |      f       |   1268398732   |     mci
 09568831245    |      f       |   7412556443   |     mtn
 09469774390    |      m       |   5488312790   |     mci

This query takes almost ~65 seconds for the first time and ~8 seconds for next times:

select operator,count(*) from categorised_phones where postal_code like '1%' group by operator;

And the output looks like this:

operator |  count  
----------+---------
 mci      | 4050314
 mtn      | 6235778

And the output of explain alanyze :

HashAggregate  (cost=1364980.61..1364980.63 rows=2 width=10) (actual time=8257.026..8257.026 rows=2 loops=1)
   Group Key: operator
   ->  Bitmap Heap Scan on categorised_phones  (cost=95969.17..1312915.34 rows=10413054 width=2) (actual time=1140.803..6332.534 rows=10286092 loops=1)
         Recheck Cond: ((postal_code)::text ~~ '1%'::text)
         Rows Removed by Index Recheck: 25105697
         Heap Blocks: exact=50449 lossy=237243
         ->  Bitmap Index Scan on postal_code_trgm_idx  (cost=0.00..93365.90 rows=10413054 width=0) (actual time=1129.270..1129.270 rows=10287127 loops=1)
               Index Cond: ((postal_code)::text ~~ '1%'::text)
 Planning time: 0.540 ms
 Execution time: 8257.392 ms

How can I make this query faster?

Any idea would be great appreciated.

P.S:

I'm using PostgreSQL 9.6.1

UPDATE

I just updated the question. I disabled Parallel Query and results changed.

like image 699
Hamed Kamrava Avatar asked Sep 14 '25 12:09

Hamed Kamrava


1 Answers

For queries that involve comparisons of the form LIKE '%start', and following PostgreSQL own advice, you can use the following index:

CREATE INDEX postal_code_idx  ON categorised_phones (postal_code varchar_pattern_ops) ;

With that index in place, and some simulated data, your execution plan could very likely look like:

| QUERY PLAN                                                                                                                             |
| :------------------------------------------------------------------------------------------------------------------------------------- |
| HashAggregate  (cost=2368.65..2368.67 rows=2 width=12) (actual time=18.093..18.094 rows=2 loops=1)                                     |
|   Group Key: operator                                                                                                                  |
|   ->  Bitmap Heap Scan on categorised_phones  (cost=536.79..2265.83 rows=20564 width=4) (actual time=2.564..12.061 rows=22171 loops=1) |
|         Filter: ((postal_code)::text ~~ '1%'::text)                                                                                    |
|         Heap Blocks: exact=1455                                                                                                        |
|         ->  Bitmap Index Scan on postal_code_idx  (cost=0.00..531.65 rows=21923 width=0) (actual time=2.386..2.386 rows=22171 loops=1) |
|               Index Cond: (((postal_code)::text ~>=~ '1'::text) AND ((postal_code)::text ~<~ '2'::text))                               |
| Planning time: 0.119 ms                                                                                                                |
| Execution time: 18.122 ms                                                                                                              |

You can check it at dbfiddle here

If you have both queries with LIKE 'start%' and LIKE '%middle%', you should add this index, but keep the one already in place. Trigram indexes might prove useful with this second kind of match.


Why?

From PostgreSQL documentation on operator classes:

The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops support B-tree indexes on the types text, varchar, and char respectively. The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules. This makes these operator classes suitable for use by queries involving pattern matching expressions (LIKE or POSIX regular expressions) when the database does not use the standard "C" locale.

From PostgreSQL documentation on Index Types

The optimizer can also use a B-tree index for queries involving the pattern matching operators LIKE and ~ if the pattern is a constant and is anchored to the beginning of the string — for example, col LIKE 'foo%' or col ~ '^foo', but not col LIKE '%bar'. However, if your database does not use the C locale you will need to create the index with a special operator class to support indexing of pattern-matching queries; see Section 11.9 below. It is also possible to use B-tree indexes for ILIKE and ~*, but only if the pattern starts with non-alphabetic characters, i.e., characters that are not affected by upper/lower case conversion.


UPDATE

If the queries performed involved always a fix (and relatively small) number of LIKE 'x%' expressions, consider using partial indexes.

For instance, for LIKE '1%', you'd have the following index, and the following query plan (it shows about a 3x improvement):

CREATE INDEX idx_1 ON categorised_phones (operator) WHERE postal_code LIKE '1%';
VACUUM categorised_phones ;
| QUERY PLAN                                                                                                                                    |
| :-------------------------------------------------------------------------------------------------------------------------------------------- |
| GroupAggregate  (cost=0.29..658.74 rows=3 width=12) (actual time=3.235..6.493 rows=2 loops=1)                                                 |
|   Group Key: operator                                                                                                                         |
|   ->  Index Only Scan using idx_1 on categorised_phones  (cost=0.29..554.10 rows=20921 width=4) (actual time=0.028..3.266 rows=22290 loops=1) |
|         Heap Fetches: 0                                                                                                                       |
| Planning time: 0.293 ms                                                                                                                       |
| Execution time: 6.517 ms                                                                                                                      |
like image 154
joanolo Avatar answered Sep 16 '25 09:09

joanolo