Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slow query after big INSERT in Postgres

We are using Postgres 9.2 in RedHat. We have a table similar to the following:

CREATE TABLE BULK_WI (
    BULK_ID INTEGER NOT NULL,
    USER_ID VARCHAR(20) NOT NULL,
    CHUNK_ID INTEGER,
    STATE VARCHAR(16),
    CONSTRAINT BASE_BULK_WI_PK PRIMARY KEY(BULK_ID,USER_ID)
);
CREATE INDEX BASE_BULK_WI_IDX01 ON BULK_WI(STATE, CHUNK_ID);

As part of a batch job, we first add a number of rows to the table using a new BULK_ID. All new records have CHUNK_ID = NULL, STATE = 'PENDING'. The insert is between 500K and 1.5M rows. The size of the table when this happens is over 15M records.

After the insert, we start processing the table in chunks. To do this, we first select a number of items for the next chunk, then process them. Selecting the items happens with the following query:

UPDATE BASE_BULK_WI wi SET wi.STATE = 'PROCESSING', wi.CHUNK_ID = $1 
WHERE wi.STATE='PENDING' AND wi.BULK_ID = $2 
AND wi.USER_ID IN 
    (SELECT USER_ID FROM BASE_BULK_WI WHERE BULK_ID = $3 
     AND CHUNK_ID IS NULL AND STATE='PENDING' LIMIT $4 FOR UPDATE)

$1 increases with each chunk iteration, $2 and $3 are always the same (the BULK_ID just inserted), $4 is usually between 2,000 and 10,000.

The problem is that the first few chunks take a long time to be updated. For example, for a limit of 2000 most of the updates occur in under 1 second, while the first few ones take over 2 minutes.

We are trying to understand why this happens and how to fix it. After reading the documentation:

To ensure data page consistency, the first modification of a data page after each checkpoint results in logging the entire page content.

we believe it's related to the checkpoints and the WAL, but we've been unable to pin it down.

Any suggestions?

like image 427
xpapad Avatar asked Jan 24 '26 20:01

xpapad


1 Answers

ANALYZE

The autovacuum daemon also runs ANALYZE automatically, but it takes some time to kick in. If you run UPDATE immediately after a huge INSERT, make sure to run ANALYZE in between to update statistics, or the query planner may make bad choices.

FROM clause instead of IN

IN is notoriously slow with large subqueries. This might perform better:

UPDATE base_bulk_wi wi
SET   wi.state = 'PROCESSING'
    , wi.chunk_id = $1 
FROM (
    SELECT user_id, bulk_id 
    FROM   base_bulk_wi
    WHERE  bulk_id = $3 
    AND    chunk_id IS NULL
    AND    state = 'PENDING'
    LIMIT  $4
    FOR    UPDATE
    ) x 
WHERE wi.bulk_id = x.bulk_id
AND   wi.user_id = x.user_id;

Indexes

A partial index like this should be the optimum for your case:

CREATE INDEX base_bulk_wi_partial_idx01 ON bulk_wi(chunk_id)
WHERE state = 'PENDING' AND chunk_id IS NULL;

For best performance, create this index after your INSERT. If it already exists, it might help to drop before and recreate after.

One might think it's a good idea to include bulk_id in this index to allow for an index-only scan in Postgres 9.2. But since you have FOR UPDATE in the subquery, this is not an option anyway.

It would help quite a bit, if user_id was an integer instead of varchar. (Foreign key to a user table.) Besides faster handling and a smaller table, two integer fit perfectly in a minimum-size index. Your primary key would benefit quite a bit.

like image 129
Erwin Brandstetter Avatar answered Jan 27 '26 11:01

Erwin Brandstetter