I want to concatenate arrays across rows and then do a distinct count. Ideally, this would work:
WITH test AS
(
SELECT
DATE('2018-01-01') as date,
2 as value,
[1,2,3] as key
UNION ALL
SELECT
DATE('2018-01-02') as date,
3 as value,
[1,4,5] as key
)
SELECT
SUM(value) as total_value,
ARRAY_LENGTH(ARRAY_CONCAT_AGG(DISTINCT key)) as unique_key_count
FROM test
Unfortunately, the ARRAY_CONCAT_AGG function doesn't support the DISTINCT operator. I can unnest the array but then I get a fanout and the sum of the value column is wrong:
WITH test AS
(
SELECT
DATE('2018-01-01') as date,
2 as value,
[1,2,3] as key
UNION ALL
SELECT
DATE('2018-01-02') as date,
3 as value,
[1,4,5] as key
)
SELECT
SUM(value) as total_value,
COUNT(DISTINCT k) as unique_key_count
FROM test
CROSS JOIN UNNEST(key) k

Is there anything I'm missing that would allow me to avoid joining in the unnested array?
Here is an alternative:
CREATE TEMP FUNCTION DistinctCount(arr ANY TYPE) AS (
(SELECT COUNT(DISTINCT x) FROM UNNEST(arr) AS x)
);
WITH test AS
(
SELECT
DATE('2018-01-01') as date,
2 as value,
[1,2,3] as key
UNION ALL
SELECT
DATE('2018-01-02') as date,
3 as value,
[1,4,5] as key
)
SELECT
SUM(value) as total_value,
DistinctCount(ARRAY_CONCAT_AGG(key)) as unique_key_count
FROM test
This avoids having a subquery or needing to join the array with the table (causing duplicate values in the sum).
Below is for BigQuery Standard SQL
#standardSQL
WITH test AS
(
SELECT DATE('2018-01-01') AS DATE, 2 AS value, [1,2,3] AS key UNION ALL
SELECT DATE('2018-01-02') AS DATE, 3 AS value, [1,4,5] AS key
)
SELECT
total_value,
COUNT(DISTINCT key) unique_key_count
FROM (
SELECT
SUM(value) AS total_value,
ARRAY_CONCAT_AGG(key) AS all_keys
FROM test
), UNNEST(all_keys) key
GROUP BY total_value
result :
Row total_value unique_key_count
1 5 5
In case you you have quite a number of rows in your table - you can easily get to memory/resources issue - in this case you can try using HyperLogLog++ Functions for approximate aggregation - see example below
#standardSQL
WITH test AS
(
SELECT DATE('2018-01-01') AS DATE, 2 AS value, [1,2,3] AS key UNION ALL
SELECT DATE('2018-01-02') AS DATE, 3 AS value, [1,4,5] AS key
)
SELECT
SUM(value) total_value,
HLL_COUNT.MERGE((SELECT HLL_COUNT.INIT(key) FROM UNNEST(key) key)) AS unique_key_count
FROM test
with result
Row total_value unique_key_count
1 5 5
Note: this is approximate aggregations - so pay attention to precision parameter in HLL_COUNT.INIT(input [, precision]) function
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With