Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve performance of custom aggregate function in PostgreSQL

I have custom aggregate sum function which accepts boolean data type:

create or replace function badd (bigint, boolean)
  returns bigint as
$body$
select $1 + case when $2 then 1 else 0 end;
$body$ language sql;

create aggregate sum(boolean) (
  sfunc=badd,
  stype=int8,
  initcond='0'
);

This aggregate should calculate number of rows with TRUE. For example the following should return 2 (and it does):

with t (x) as 
   (values 
      (true::boolean), 
      (false::boolean), 
      (true::boolean),
      (null::boolean)      
      )
select sum(x) from t;

However, it's performance is quite bad, it is 5.5 times slower then using casting to integer:

with t as (select (gs > 0.5) as test_vector from generate_series(1,1000000,1) gs)
select sum(test_vector) from t; -- 52012ms

with t as (select (gs > 0.5) as test_vector from generate_series(1,1000000,1) gs)
select sum(test_vector::int) from t; -- 9484ms

Is the only way how to improve this aggregate to write some new C function - e.g. some alternative of int2_sum function in src/backend/utils/adt/numeric.c?

like image 351
Tomas Greif Avatar asked Oct 23 '25 07:10

Tomas Greif


1 Answers

Your test case is misleading, you only count TRUE. You should have both TRUE and FALSE - or even NULL, if applicable.

Like @foibs already explained, one wouldn't use a custom aggregate function for this. The built-in C-functions are much faster and do the job. Use instead (also demonstrating a simpler and more sensible test):

SELECT count(NULLIF(g%2 = 1, FALSE)) AS ct
FROM  generate_series(1,100000,1) g;

How does this work?
Compute percents from SUM() in the same SELECT sql query

Several fast & simple ways (plus a benchmark) under this related answer on dba.SE:
For absolute performance, is SUM faster or COUNT?

Or faster yet, test for TRUE in the WHERE clause, where possible:

SELECT count(*) AS ct
FROM   generate_series(1,100000,1) g;
WHERE  g%2 = 1             -- excludes FALSE and NULL !

If you'd have to write a custom aggregate for some reason, this form would be superior:

CREATE OR REPLACE FUNCTION test_sum_int8 (int8, boolean)
  RETURNS bigint as
'SELECT CASE WHEN $2 THEN $1 + 1 ELSE $1 END' LANGUAGE sql;

The addition is only executed when necessary. Your original would add 0 for the FALSE case.

Better yet, use a plpgsql function. It saves a bit of overhead per call, since it works like a prepared statement (the query is not re-planned). Makes a difference for a tiny aggregate function that is called many times:

CREATE OR REPLACE FUNCTION test_sum_plpgsql (int8, boolean)
  RETURNS bigint AS
$func$
BEGIN
RETURN CASE WHEN $2 THEN $1 + 1 ELSE $1 END;
END
$func$ LANGUAGE plpgsql;

CREATE AGGREGATE test_sum_plpgsql(boolean) (
  sfunc = test_sum_plpgsql
 ,stype = int8
 ,initcond = '0'
);

Faster than what you had, but much slower than the presented alternative with a standard count(). And slower than any other C-function, too.

->SQLfiddle

like image 68
Erwin Brandstetter Avatar answered Oct 25 '25 20:10

Erwin Brandstetter



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!