Distinct Count across Bigquery arrays

Question

I want to concatenate arrays across rows and then do a distinct count. Ideally, this would work:

WITH test AS
(
  SELECT
  DATE('2018-01-01') as date,
  2 as value,
  [1,2,3] as key
  UNION ALL
  SELECT
  DATE('2018-01-02') as date,
  3 as value,
  [1,4,5] as key
)
SELECT
  SUM(value) as total_value,
  ARRAY_LENGTH(ARRAY_CONCAT_AGG(DISTINCT key)) as unique_key_count
FROM test

Unfortunately, the ARRAY_CONCAT_AGG function doesn't support the DISTINCT operator. I can unnest the array but then I get a fanout and the sum of the value column is wrong:

WITH test AS
(
  SELECT
  DATE('2018-01-01') as date,
  2 as value,
  [1,2,3] as key
  UNION ALL
  SELECT
  DATE('2018-01-02') as date,
  3 as value,
  [1,4,5] as key
)

SELECT
  SUM(value) as total_value,
  COUNT(DISTINCT k) as unique_key_count

FROM test
  CROSS JOIN UNNEST(key) k

enter image description here

Is there anything I'm missing that would allow me to avoid joining in the unnested array?

Elliott Brossard · Accepted Answer

Here is an alternative:

CREATE TEMP FUNCTION DistinctCount(arr ANY TYPE) AS (
  (SELECT COUNT(DISTINCT x) FROM UNNEST(arr) AS x)
);

WITH test AS
(
  SELECT
  DATE('2018-01-01') as date,
  2 as value,
  [1,2,3] as key
  UNION ALL
  SELECT
  DATE('2018-01-02') as date,
  3 as value,
  [1,4,5] as key
)

SELECT
  SUM(value) as total_value,
  DistinctCount(ARRAY_CONCAT_AGG(key)) as unique_key_count
FROM test

This avoids having a subquery or needing to join the array with the table (causing duplicate values in the sum).

Mikhail Berlyant · Answer

Below is for BigQuery Standard SQL

#standardSQL
WITH test AS
(
  SELECT DATE('2018-01-01') AS DATE, 2 AS value, [1,2,3] AS key UNION ALL
  SELECT DATE('2018-01-02') AS DATE, 3 AS value, [1,4,5] AS key
)
SELECT 
  total_value,
  COUNT(DISTINCT key) unique_key_count
FROM (
  SELECT
    SUM(value) AS total_value,
    ARRAY_CONCAT_AGG(key) AS all_keys
  FROM test
), UNNEST(all_keys) key
GROUP BY total_value

result :

Row total_value unique_key_count     
1   5           5

In case you you have quite a number of rows in your table - you can easily get to memory/resources issue - in this case you can try using HyperLogLog++ Functions for approximate aggregation - see example below

#standardSQL
WITH test AS
(
  SELECT DATE('2018-01-01') AS DATE, 2 AS value, [1,2,3] AS key UNION ALL
  SELECT DATE('2018-01-02') AS DATE, 3 AS value, [1,4,5] AS key
)
SELECT
  SUM(value) total_value,
  HLL_COUNT.MERGE((SELECT HLL_COUNT.INIT(key) FROM UNNEST(key) key)) AS unique_key_count
FROM test

with result

Row total_value unique_key_count     
1   5           5

Note: this is approximate aggregations - so pay attention to precision parameter in HLL_COUNT.INIT(input [, precision]) function

Distinct Count across Bigquery arrays

Tags:

arrays

google-bigquery

alan

2 Answers

Elliott Brossard

Mikhail Berlyant

Recent Activity

Donate For Us

Distinct Count across Bigquery arrays

Tags:

arrays

google-bigquery

alan

2 Answers

Elliott Brossard

Mikhail Berlyant

Related questions

Recent Activity

Donate For Us