Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select value of a column based on another column min(date) with Bigquery

Let say I have the following table :

AccountID   Email              status_update       date (dd/mm/yyyy) 
123456      [email protected]      state1              02/02/2016
123456      [email protected]      state2              10/010/2018
456123      [email protected]      state2              05/04/2017
789123      [email protected]   state2              22/04/2016
789123      [email protected]   state1              17/06/2018
456345      [email protected]     state1              13/08/2017
456345      [email protected]     state2              09/07/2015
456345      [email protected]     state2              09/07/2014

here is the output I want :

UniqueID    Email              Most_recent_status_Update  CountUniqueID                    
123456      [email protected]      state2                     2
456123      [email protected]      state2                     1    
789123      [email protected]   state1                     2              
456345      [email protected]     state1                     3

So basically I want to be able to group by Email, to do a unique(AccountID) and to select status_update based in MIN(date)

The difficult part to me is being able to select status_update based in MIN(date). I would like to be able to do something like this :

Select status_update when date = min(date)

I'm thinking about using CASE formula in order to have something like this :

Select CASE (WHEN date = min(date) Then status_update else null END

I always end up with all kind of error. I'm not able to build the whole query.

thanks

like image 730
Simon Breton Avatar asked Jan 18 '26 12:01

Simon Breton


1 Answers

See example below - for BigQuery Standard SQL

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 123456 accountID, '[email protected]' email, 'state1' status_update, '02/02/2016' dt UNION ALL
  SELECT 123456, '[email protected]', 'state2', '10/10/2018' UNION ALL
  SELECT 456123, '[email protected]', 'state2', '05/04/2017' UNION ALL
  SELECT 789123, '[email protected]', 'state2', '22/04/2016' UNION ALL
  SELECT 789123, '[email protected]', 'state1', '17/06/2018' UNION ALL
  SELECT 456345, '[email protected]', 'state1', '13/08/2017' UNION ALL
  SELECT 456345, '[email protected]', 'state2', '09/07/2015' UNION ALL
  SELECT 456345, '[email protected]', 'state2', '09/07/2014' 
)
SELECT 
  accountID, 
  email, 
  ARRAY_AGG(status_update ORDER BY PARSE_DATE('%d/%m/%Y', dt) DESC LIMIT 1)[OFFSET (0)] most_recent_status_update,
  COUNT(1) AS cnt
FROM `project.dataset.table`
GROUP BY accountID, email   

with result as

Row accountID   email               most_recent_status_update   cnt  
1   456123      [email protected]       state2                      1    
2   123456      [email protected]       state2                      2    
3   789123      [email protected]    state1                      2    
4   456345      [email protected]      state1                      3     

for your real use case you should use something like below

#standardSQL
SELECT 
  accountID, 
  email, 
  ARRAY_AGG(status_update ORDER BY PARSE_DATE('%d/%m/%Y', dt) DESC LIMIT 1)[OFFSET (0)] most_recent_status_update,
  COUNT(1) AS cnt
FROM `project.dataset.table`
GROUP BY accountID, email    

Note: this example assumes specific schema / data types (based on your example in question) - so if your real datatypes are different - you need to adjust things a little :o)

like image 85
Mikhail Berlyant Avatar answered Jan 20 '26 11:01

Mikhail Berlyant



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!