Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LIMIT per group - Google BigQuery/Standard SQL

I have a table like the following (example here):

CREATE TABLE topics (
  name varchar(64),
  url varchar(253),
  statistic integer,
  pubdate timestamp
);

INSERT INTO topics VALUES
('a',  'b',  100,  TIMESTAMP '2011-05-16 15:36:38'),  
('a',  'c',  110,  TIMESTAMP '2014-04-01 00:00:00'),  
('a',  'd',  120,  TIMESTAMP '2014-04-01 00:00:00'),  
('a',  'e',  90,   TIMESTAMP '2011-05-16 15:36:38'), 
('a',  'f',  80,   TIMESTAMP '2014-04-01 00:00:00'), 
('a',  'g',  70,   TIMESTAMP '2011-05-16 15:36:38'), 
('a',  'h',  150,  TIMESTAMP '2014-04-01 00:00:00'),  
('a',  'i',  50,   TIMESTAMP '2011-05-16 15:36:38'), 
('b',  'j',  10,   TIMESTAMP '2014-04-01 00:00:00'), 
('b',  'k',  11,   TIMESTAMP '2011-05-16 15:36:38'), 
('b',  'l',  12,   TIMESTAMP '2014-04-01 00:00:00'), 
('b',  'm',  9,    TIMESTAMP '2011-05-16 15:36:38'),
('b',  'n',  8,    TIMESTAMP '2014-04-01 00:00:00'),
('b',  'o',  7,    TIMESTAMP '2011-05-16 15:36:38'),
('b',  'p',  15,   TIMESTAMP '2014-04-01 00:00:00'), 
('b',  'q',  5,    TIMESTAMP '2011-05-16 15:36:38'),
('b',  'r',  2,    TIMESTAMP '2014-04-01 00:00:00')

I would like to take the top two rows based on their statistic value _from each (name, date(pubdate)) combination.

In other words, I would like to GROUP BY name, date(pubdate), but without an aggregate function, and instead with the result simply taking the top two rows based on their statistic from each group. (So, I know it is not really a GROUP BY, but rather a greatest-n-per-group.)

I'm using Google Big Query with Standard SQL. I've looked at a number of other solutions but am unsure how to achieve the result in this case.

Desired result:

name    url     statistic   date

a       b       100         2011-05-16
a       e       90          2011-05-16

a       h       150         2014-04-01
a       d       120         2014-04-01

b       m       9           2011-05-16
b       k       11          2011-05-16

b       l       12          2014-04-01
b       p       15          2014-04-01
like image 200
Brad Solomon Avatar asked Mar 22 '26 21:03

Brad Solomon


2 Answers

Below is for BigQuery Standard SQL

#standardSQL
SELECT * EXCEPT(arr) FROM (
  SELECT name, DATE(pubdate) day, 
    ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) arr
  FROM `project.dataset.table`   
  GROUP BY name, day
), UNNEST(arr)
-- ORDER BY name, day  

You can test, play with above using sample data in your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'a' name, 'b' url,  100 statistic,  TIMESTAMP '2011-05-16 15:36:38' pubdate UNION ALL  
  SELECT 'a', 'c',  110,  '2014-04-01 00:00:00' UNION ALL  
  SELECT 'a', 'd',  120,  '2014-04-01 00:00:00' UNION ALL  
  SELECT 'a', 'e',  90,   '2011-05-16 15:36:38' UNION ALL 
  SELECT 'a', 'f',  80,   '2014-04-01 00:00:00' UNION ALL 
  SELECT 'a', 'g',  70,   '2011-05-16 15:36:38' UNION ALL 
  SELECT 'a', 'h',  150,  '2014-04-01 00:00:00' UNION ALL  
  SELECT 'a', 'i',  50,   '2011-05-16 15:36:38' UNION ALL 
  SELECT 'b', 'j',  10,   '2014-04-01 00:00:00' UNION ALL 
  SELECT 'b', 'k',  11,   '2011-05-16 15:36:38' UNION ALL 
  SELECT 'b', 'l',  12,   '2014-04-01 00:00:00' UNION ALL 
  SELECT 'b', 'm',  9,    '2011-05-16 15:36:38' UNION ALL
  SELECT 'b', 'n',  8,    '2014-04-01 00:00:00' UNION ALL
  SELECT 'b', 'o',  7,    '2011-05-16 15:36:38' UNION ALL
  SELECT 'b', 'p',  15,   '2014-04-01 00:00:00' UNION ALL 
  SELECT 'b', 'q',  5,    '2011-05-16 15:36:38' UNION ALL
  SELECT 'b', 'r',  2,    '2014-04-01 00:00:00' 
)
SELECT * EXCEPT(arr) FROM (
  SELECT name, DATE(pubdate) day, 
    ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) arr
  FROM `project.dataset.table`  
  GROUP BY name, day
), UNNEST(arr)
ORDER BY name, day   

with result

Row name    day         url statistic    
1   a       2011-05-16  b   100  
2   a       2011-05-16  e   90   
3   a       2014-04-01  h   150  
4   a       2014-04-01  d   120  
5   b       2011-05-16  k   11   
6   b       2011-05-16  m   9    
7   b       2014-04-01  p   15   
8   b       2014-04-01  l   12   
like image 127
Mikhail Berlyant Avatar answered Mar 24 '26 11:03

Mikhail Berlyant


Use the ARRAY_AGG function:

SELECT
  name,
  DATE(pubdate) AS pubdate,
  ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) AS top_urls
FROM dataset.table
GROUP BY name, pubdate

You can use a subquery with UNNEST to get rows as output without arrays:

SELECT name, pubdate, url, statistic
FROM (
  SELECT
    name,
    DATE(pubdate) AS pubdate,
    ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) AS top_urls
  FROM dataset.table
  GROUP BY name, pubdate
), UNNEST(top_urls)
like image 26
Elliott Brossard Avatar answered Mar 24 '26 11:03

Elliott Brossard



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!