I have custom aggregate sum function which accepts boolean data type:
create or replace function badd (bigint, boolean)
returns bigint as
$body$
select $1 + case when $2 then 1 else 0 end;
$body$ language sql;
create aggregate sum(boolean) (
sfunc=badd,
stype=int8,
initcond='0'
);
This aggregate should calculate number of rows with TRUE. For example the following should return 2 (and it does):
with t (x) as
(values
(true::boolean),
(false::boolean),
(true::boolean),
(null::boolean)
)
select sum(x) from t;
However, it's performance is quite bad, it is 5.5 times slower then using casting to integer:
with t as (select (gs > 0.5) as test_vector from generate_series(1,1000000,1) gs)
select sum(test_vector) from t; -- 52012ms
with t as (select (gs > 0.5) as test_vector from generate_series(1,1000000,1) gs)
select sum(test_vector::int) from t; -- 9484ms
Is the only way how to improve this aggregate to write some new C function - e.g. some alternative of int2_sum function in src/backend/utils/adt/numeric.c?
Your test case is misleading, you only count TRUE. You should have both TRUE and FALSE - or even NULL, if applicable.
Like @foibs already explained, one wouldn't use a custom aggregate function for this. The built-in C-functions are much faster and do the job. Use instead (also demonstrating a simpler and more sensible test):
SELECT count(NULLIF(g%2 = 1, FALSE)) AS ct
FROM generate_series(1,100000,1) g;
How does this work?
Compute percents from SUM() in the same SELECT sql query
Several fast & simple ways (plus a benchmark) under this related answer on dba.SE:
For absolute performance, is SUM faster or COUNT?
Or faster yet, test for TRUE in the WHERE clause, where possible:
SELECT count(*) AS ct
FROM generate_series(1,100000,1) g;
WHERE g%2 = 1 -- excludes FALSE and NULL !
If you'd have to write a custom aggregate for some reason, this form would be superior:
CREATE OR REPLACE FUNCTION test_sum_int8 (int8, boolean)
RETURNS bigint as
'SELECT CASE WHEN $2 THEN $1 + 1 ELSE $1 END' LANGUAGE sql;
The addition is only executed when necessary. Your original would add 0 for the FALSE case.
Better yet, use a plpgsql function. It saves a bit of overhead per call, since it works like a prepared statement (the query is not re-planned). Makes a difference for a tiny aggregate function that is called many times:
CREATE OR REPLACE FUNCTION test_sum_plpgsql (int8, boolean)
RETURNS bigint AS
$func$
BEGIN
RETURN CASE WHEN $2 THEN $1 + 1 ELSE $1 END;
END
$func$ LANGUAGE plpgsql;
CREATE AGGREGATE test_sum_plpgsql(boolean) (
sfunc = test_sum_plpgsql
,stype = int8
,initcond = '0'
);
Faster than what you had, but much slower than the presented alternative with a standard count(). And slower than any other C-function, too.
->SQLfiddle
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With