I have a table like the following (example here):
CREATE TABLE topics (
name varchar(64),
url varchar(253),
statistic integer,
pubdate timestamp
);
INSERT INTO topics VALUES
('a', 'b', 100, TIMESTAMP '2011-05-16 15:36:38'),
('a', 'c', 110, TIMESTAMP '2014-04-01 00:00:00'),
('a', 'd', 120, TIMESTAMP '2014-04-01 00:00:00'),
('a', 'e', 90, TIMESTAMP '2011-05-16 15:36:38'),
('a', 'f', 80, TIMESTAMP '2014-04-01 00:00:00'),
('a', 'g', 70, TIMESTAMP '2011-05-16 15:36:38'),
('a', 'h', 150, TIMESTAMP '2014-04-01 00:00:00'),
('a', 'i', 50, TIMESTAMP '2011-05-16 15:36:38'),
('b', 'j', 10, TIMESTAMP '2014-04-01 00:00:00'),
('b', 'k', 11, TIMESTAMP '2011-05-16 15:36:38'),
('b', 'l', 12, TIMESTAMP '2014-04-01 00:00:00'),
('b', 'm', 9, TIMESTAMP '2011-05-16 15:36:38'),
('b', 'n', 8, TIMESTAMP '2014-04-01 00:00:00'),
('b', 'o', 7, TIMESTAMP '2011-05-16 15:36:38'),
('b', 'p', 15, TIMESTAMP '2014-04-01 00:00:00'),
('b', 'q', 5, TIMESTAMP '2011-05-16 15:36:38'),
('b', 'r', 2, TIMESTAMP '2014-04-01 00:00:00')
I would like to take the top two rows based on their statistic value _from each (name, date(pubdate)) combination.
In other words, I would like to GROUP BY name, date(pubdate), but without an aggregate function, and instead with the result simply taking the top two rows based on their statistic from each group. (So, I know it is not really a GROUP BY, but rather a greatest-n-per-group.)
I'm using Google Big Query with Standard SQL. I've looked at a number of other solutions but am unsure how to achieve the result in this case.
Desired result:
name url statistic date
a b 100 2011-05-16
a e 90 2011-05-16
a h 150 2014-04-01
a d 120 2014-04-01
b m 9 2011-05-16
b k 11 2011-05-16
b l 12 2014-04-01
b p 15 2014-04-01
Below is for BigQuery Standard SQL
#standardSQL
SELECT * EXCEPT(arr) FROM (
SELECT name, DATE(pubdate) day,
ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) arr
FROM `project.dataset.table`
GROUP BY name, day
), UNNEST(arr)
-- ORDER BY name, day
You can test, play with above using sample data in your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'a' name, 'b' url, 100 statistic, TIMESTAMP '2011-05-16 15:36:38' pubdate UNION ALL
SELECT 'a', 'c', 110, '2014-04-01 00:00:00' UNION ALL
SELECT 'a', 'd', 120, '2014-04-01 00:00:00' UNION ALL
SELECT 'a', 'e', 90, '2011-05-16 15:36:38' UNION ALL
SELECT 'a', 'f', 80, '2014-04-01 00:00:00' UNION ALL
SELECT 'a', 'g', 70, '2011-05-16 15:36:38' UNION ALL
SELECT 'a', 'h', 150, '2014-04-01 00:00:00' UNION ALL
SELECT 'a', 'i', 50, '2011-05-16 15:36:38' UNION ALL
SELECT 'b', 'j', 10, '2014-04-01 00:00:00' UNION ALL
SELECT 'b', 'k', 11, '2011-05-16 15:36:38' UNION ALL
SELECT 'b', 'l', 12, '2014-04-01 00:00:00' UNION ALL
SELECT 'b', 'm', 9, '2011-05-16 15:36:38' UNION ALL
SELECT 'b', 'n', 8, '2014-04-01 00:00:00' UNION ALL
SELECT 'b', 'o', 7, '2011-05-16 15:36:38' UNION ALL
SELECT 'b', 'p', 15, '2014-04-01 00:00:00' UNION ALL
SELECT 'b', 'q', 5, '2011-05-16 15:36:38' UNION ALL
SELECT 'b', 'r', 2, '2014-04-01 00:00:00'
)
SELECT * EXCEPT(arr) FROM (
SELECT name, DATE(pubdate) day,
ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) arr
FROM `project.dataset.table`
GROUP BY name, day
), UNNEST(arr)
ORDER BY name, day
with result
Row name day url statistic
1 a 2011-05-16 b 100
2 a 2011-05-16 e 90
3 a 2014-04-01 h 150
4 a 2014-04-01 d 120
5 b 2011-05-16 k 11
6 b 2011-05-16 m 9
7 b 2014-04-01 p 15
8 b 2014-04-01 l 12
Use the ARRAY_AGG function:
SELECT
name,
DATE(pubdate) AS pubdate,
ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) AS top_urls
FROM dataset.table
GROUP BY name, pubdate
You can use a subquery with UNNEST to get rows as output without arrays:
SELECT name, pubdate, url, statistic
FROM (
SELECT
name,
DATE(pubdate) AS pubdate,
ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) AS top_urls
FROM dataset.table
GROUP BY name, pubdate
), UNNEST(top_urls)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With