POSTGRES - Efficient SELECT or INSERT with multiple connections

Question

tl;dr I'm trying to figure out the most efficient way to SELECT a record or INSERT it if it doesn't already exist that will work with multiple concurrent connections.

The situation: I'm constructing a Postgres database (9.3.5, x64) containing a whole bunch of information associated with a customer. This database features a "customers" table that contains an "id" column (SERIAL PRIMARY KEY), and a "system_id" column (VARCHAR(64)). The id column is used as a foreign key in other tables to link to the customer. The "system_id" column must be unique, if it is not null.

CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    system_id VARCHAR(64),
    name VARCHAR(256));

An example of a table that references the id in the customers table:

CREATE TABLE tsrs (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL REFERENCES customers(id),
    filename VARCHAR(256) NOT NULL,
    name VARCHAR(256),
    timestamp TIMESTAMP WITHOUT TIME ZONE);

I have written a python script that uses the multiprocessing module to push data into the database through multiple connections (from different processes).

The first thing that each process needs to do when pushing data into the database is to check if a customer with a particular system_id is in the customers table. If it is, the associated customer.id is cached. If its not already in the table, a new row is added, and the resulting customer.id is cached. I have written an SQL function to do this for me:

CREATE OR REPLACE FUNCTION get_or_insert_customer(p_system_id customers.system_id%TYPE, p_name customers.name%TYPE) RETURNS customers.id%TYPE AS $$
DECLARE
    v_id customers.id%TYPE;
BEGIN
    LOCK TABLE customers IN EXCLUSIVE MODE;
    SELECT id INTO v_id FROM customers WHERE system_id=p_system_id;
    IF v_id is NULL THEN
        INSERT INTO customers(system_id, name)
            VALUES(p_system_id,p_name)
            RETURNING id INTO v_id;
    END IF;
    RETURN v_id;
END;
$$ LANGUAGE plpgsql;

The problem: The table locking was the only way I was able to prevent duplicate system_ids being added to the table by concurrent processes. This isn't really ideal as it effectively serialises all the processing at this point, and basically doubles the amount of time that it takes to push a given amount of data into the db.

I wanted to ask if there was a more efficient/elegant way to implement the "SELECT or INSERT" mechanism that wouldn't cause as much of a slow down? I suspect that there isn't, but figured it was worth asking, just in case.

Many thanks for reading this far. Any advice is much appreciated!

wildplasser · Accepted Answer

I managed to rewrite the function into plain SQL, changing the order (avoiding the IF and the potential race condition)

CREATE OR REPLACE FUNCTION get_or_insert_customer
        ( p_system_id customers.system_id%TYPE
        , p_name customers.name%TYPE
        )  RETURNS customers.id%TYPE AS $func$

    LOCK TABLE customers IN EXCLUSIVE MODE;
    INSERT INTO customers(system_id, name)
    SELECT p_system_id,p_name
     WHERE NOT EXISTS (SELECT 1 FROM customers WHERE system_id = p_system_id)
        ;

    SELECT id
        FROM customers WHERE system_id = p_system_id
        ;
$func$ LANGUAGE sql;

POSTGRES - Efficient SELECT or INSERT with multiple connections

Tags:

python

sql

database

postgresql

concurrency

JBeFat

1 Answers

wildplasser

Recent Activity

Donate For Us

POSTGRES - Efficient SELECT or INSERT with multiple connections

Tags:

python

sql

database

postgresql

concurrency

JBeFat

1 Answers

wildplasser

Related questions

Recent Activity

Donate For Us