When storing a tsvector value in a column, for records with no search terms, should I store an empty tsvector or a NULL value?
Does it matter?
Is there any difference in terms of performance, or storage overhead from storing empty vectors?
In other words, when updating the vector based on a value of, say, a nullable title column, do I need to always compute this as to_tsvector(coalesce(title,'')) (since to_tsvector returns NULL when given a NULL argument) or is it enough to do to_tsvector(title)?
First off, the semantics of SQL NULL is that of UNKNOWN, whereas some data types also have an "empty" value. Those data types include:
TEXT ('' isn't the same as NULL::TEXT)JSON and JSONB ([] or {} aren't the same as NULL::JSON or NULL:JSONB)X[] (ARRAY[]::X[] isn't the same as NULL::X[])There are many more, including TSVECTOR. The semantics of an empty collection of something is always subtly different from that of a NULL value, which is an UNKNOWN collection (often just used as an absent collection, though). The distiction manifests specifically when it comes to using operators, e.g.
'' || 'abc' = 'abc' but NULL || 'abc' IS NULLto_tsvector('cats ate rats') @@ to_tsquery('cat & rat') = true but NULL @@ to_tsquery('cat & rat') IS NULLIn that sense, the decision should be foremost a logic one, not a storage one, based on this question: Will you still work with the TSVECTOR value of a record, even if the record doesn't have any search terms (pro empty TSVECTOR)? Or does the feature not apply at all to that particular record (pro NULL value)? For the @@ operator, it may not be so relevant, but it definitely is for the || operator, and others.
The answer isn't obvious, nor is there a clear right / wrong way in general.
If this is a highly performance sensitive situation in your application (e.g. you have a lot of empty TSVECTOR values), then maybe, this benchmark could help you with the decision?
I ran the below benchmark on PostgreSQL 14.1 in Docker to get this result:
RUN 1, Statement 1: 2.91145
RUN 1, Statement 2: 1.00000 -- The fastest run is 1. The others are multiples of 1
RUN 2, Statement 1: 2.80509
RUN 2, Statement 2: 1.05232
RUN 3, Statement 1: 2.78001
RUN 3, Statement 2: 1.00202
RUN 4, Statement 1: 2.74319
RUN 4, Statement 2: 1.00524
RUN 5, Statement 1: 2.75808
RUN 5, Statement 2: 1.00045
SELECT v @@ to_tsquery('cat & rat') with v tsvector = to_tsvector('');SELECT NULL @@ to_tsquery('cat & rat')The fact that NULL is involved probably leads to a shortcut in the @@ operator's algorithm, which produces a 2.7x performance improvement over querying an empty TSVECTOR in the benchmark. So, there do seem to be benefits of using NULL in terms of performance.
Obviously, that's just a benchmark, which doesn't necessarily reflect real-world use-cases, but it should give you a hint of a potential difference.
Benchmark code
For reproduction or adaptations, here's a benchmark, based on this technique.
DO $$
DECLARE
v_ts TIMESTAMP;
v_repeat CONSTANT INT := 10000;
rec RECORD;
run INT[];
stmt INT[];
elapsed DECIMAL[];
min_elapsed DECIMAL;
i INT := 1;
-- Store the vector in a local variable to avoid re-computing it in the benchmark
v tsvector = to_tsvector('');
BEGIN
-- Repeat the whole benchmark several times to avoid warmup penalty
FOR r IN 1..5 LOOP
v_ts := clock_timestamp();
FOR i IN 1..v_repeat LOOP
FOR rec IN (
-- Statement 1
SELECT v @@ to_tsquery('cat & rat')
) LOOP
NULL;
END LOOP;
END LOOP;
run[i] := r;
stmt[i] := 1;
elapsed[i] := (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP))
- EXTRACT(EPOCH FROM v_ts));
i := i + 1;
v_ts := clock_timestamp();
FOR i IN 1..v_repeat LOOP
FOR rec IN (
-- Statement 2
SELECT NULL @@ to_tsquery('cat & rat')
) LOOP
NULL;
END LOOP;
END LOOP;
run[i] := r;
stmt[i] := 2;
elapsed[i] := (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP))
- EXTRACT(EPOCH FROM v_ts));
i := i + 1;
END LOOP;
SELECT min(t.elapsed)
INTO min_elapsed
FROM unnest(elapsed) AS t(elapsed);
FOR i IN 1..array_length(run, 1) LOOP
RAISE INFO 'RUN %, Statement %: %', run[i], stmt[i],
CAST(elapsed[i] / min_elapsed AS DECIMAL(10, 5));
END LOOP;
END$$;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With