Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using sum(case when...) over partition of woy and hh24

Tags:

sql

postgresql

Using PostgreSQL version 9.4.18.

The following is a query that's returning unexpected results for non_zero_year_count and percent_years_count_not_zero:

Table data: Note that any date from 06-25 and later is in week of year 26. So there is an occurrence of woyhh 2607 in each of the three years--2016, 2017, and 2018. In the sqlfiddle test database I have just done a small sample of the database I'm working with. 2016-2018 with the following table. http://sqlfiddle.com/#!17/f6251/1

CREATE TABLE ltg_data
("time" timestamp with time zone)
/

INSERT INTO ltg_data
("time")
VALUES
('2018-06-23 07:19:00'),
('2018-06-24 07:19:00'),
('2018-06-25 07:19:00'),
('2018-06-26 07:19:00'),
('2018-06-26 07:19:00'),
('2018-06-24 07:19:00'),
('2018-06-25 07:19:00'),
('2018-06-26 07:19:00'),
('2018-06-26 07:19:00'),
('2018-06-24 07:19:00'),
('2018-06-25 07:19:00'),
('2018-06-26 07:19:00'),
('2018-06-26 07:19:00'),
('2018-06-24 07:19:00'),
('2018-06-25 07:19:00'),
('2018-06-26 07:19:00'),
('2018-06-25 17:19:00'),
('2018-06-25 17:19:00'),
('2017-06-25 19:19:00'),
('2017-06-25 20:19:00'),
('2017-06-26 07:19:00'),
('2017-06-26 07:19:00'),
('2017-06-24 07:19:00'),
('2017-06-24 07:19:00'),
('2017-06-23 21:19:00'),
('2017-06-23 21:19:00'),
('2017-06-24 07:19:00'),
('2016-06-26 07:19:00'),
('2016-06-25 07:19:00'),
('2016-06-25 07:19:00'),
('2016-06-27 07:19:00'),
('2016-06-26 07:19:00'),
('2016-06-26 07:19:00')

So the following query should return some basic statistics about the table data. The challenge, I think, is trying to partition over week-of-year and hour while incorporating the year somehow. The errant data involves the portion of the query that's trying to determine the number of years where there is a count > 0 for a week-of-the-year and hour (woyhh). Here's the query and functions (woy functions to factor in leap year for standardization year to year) that the query utilizes. I'm using generate series because I want a full year's worth of woyhh even if there are no counts in a certain woyhh.

Functions: (likely not critical but wanted to include in case I'm missing something and they are indeed relevant to the problem)

create or replace function IsLeapYear(int)
returns boolean as $$
select $1 % 4 = 0 and ($1 % 100 <> 0 or $1 % 400 = 0)
$$ LANGUAGE sql IMMUTABLE STRICT; 

create or replace function f_woyhh(timestamp with time zone)
returns int language plpgsql as $$
declare
currentYear int = extract (year from $1);
LeapYearShift int = 1 + (IsLeapYear(currentYear) and $1 > make_date  (currentYear, 2, 28))::int;
begin
return CONCAT(((extract(doy from $1)::int)- LeapYearShift) / 7+ 1, to_char   ($1, 'HH24'));
end;
$$;

Query:

WITH
CTE_Dates
AS
(
SELECT  f_woyhh(d) as dt


    ,EXTRACT(YEAR FROM d::timestamp) AS dtYear from
generate_series(timestamp '2016-01-01', timestamp '2018-12-31', interval '1 hour') as d
    -- full range of possible dates
)
,CTE_WeeklyHourlyCounts
AS
(
SELECT
f_woyhh(time) as dt
    ,time
    ,count(*) AS ct
FROM
    ltg_data
    GROUP BY ltg_data.time
)

,CTE_FullStats
AS
(
SELECT
    CTE_dates.dt as woyhh

    ,COUNT(DISTINCT CTE_Dates.dtYear)  AS years_count
    ,SUM(CASE WHEN CTE_WeeklyHourlyCounts.ct > 0 THEN 1 ELSE 0 END) OVER   (PARTITION BY CTE_Dates.dt) AS nonzero_year_count
,100.0 * SUM(CASE WHEN CTE_WeeklyHourlyCounts.ct > 0 THEN 1 ELSE 0 END)   OVER (PARTITION BY CTE_Dates.dt)
    / COUNT(DISTINCT CTE_Dates.dtYear) as percent_years_count_not_zero
FROM
    CTE_Dates
    LEFT JOIN CTE_WeeklyHourlyCounts ON CTE_WeeklyHourlyCounts.dt = CTE_Dates.dt
    GROUP BY CTE_dates.dt, CTE_WeeklyHourlyCounts.ct, CTE_WeeklyHourlyCounts.dt
    )

SELECT
woyhh
,nonzero_year_count
,years_count
,percent_years_count_not_zero
FROM
CTE_FullStats
WHERE woyhh::text like '26%'
    GROUP BY woyhh,   years_count, nonzero_year_count,     percent_years_count_not_zero
    ORDER BY  woyhh

Portion of the Undesired results:

woyhh | nonzero_year_count | years_count| percent_years_count_not_zero
2605  | 0                  | 3          | 0
2606  | 0                  | 3          | 0
2607  | 5                  | 3          | 166.66
2608  | 0                  | 3          | 0
2609  | 0                  | 3          | 0

The parts of the results that aren't working for woyhh 2607 are nonzero_year_count, which should be 3 because there are only 3 years of data and there is a count in week 26 and hour 07 of each of those years (any day after the 24th of the month is on week 26). Also, the percent_years_count_not_zero should be 100%, not 166%. 100% is the maximum desired percent_years_count_not_zero. I want the count for that woyhh to have occurred in all years (100%) or less...but shouldn't be more.

Desired results:

woyhh | nonzero_year_count | years_count| percent_years_count_not_zero
2605  | 0                  | 3          | 0
2606  | 0                  | 3          | 0
2607  | 3                  | 3          | 100
2608  | 0                  | 3          | 0
2609  | 0                  | 3          | 0

So I think the main problem lies with this portion of the query:

,SUM(CASE WHEN CTE_WeeklyHourlyCounts.ct > 0 THEN 1 ELSE 0 END) OVER  (PARTITION BY CTE_Dates.dt) AS nonzero_year_count

If I'm partitioning over woyhh, but this isn't sufficient because I need to factor in the years. It's like I need to group year in there somehow to determine if a woyhh has occurred in a year, then count it as one and no more for that year. I've tried incorporating year but have run into even stranger results.

I hope this clarifies my question some. I've included an updated sqlfiddle below to replicate the data/query I'm using for the test table. Thanks for any help!

http://sqlfiddle.com/#!17/f6251/1

like image 981
user1610717 Avatar asked Nov 22 '25 13:11

user1610717


1 Answers

Your CTE_WeeklyHourlyCounts definition is not fit for purpose: with the GROUP BY ltg_data.time part, there will be five records that fit the bill:

2607;"2016-06-26 07:19:00+02";3
2607;"2016-06-27 07:19:00+02";1
2607;"2017-06-26 07:19:00+02";2
2607;"2018-06-25 07:19:00+02";4
2607;"2018-06-26 07:19:00+02";7

This leads in the following calculations to nonzero_year_count getting value 5. With the below (GROUP BY ctYear), the count will be grouped by year which will produce the desired result.

,CTE_WeeklyHourlyCounts
AS
(
SELECT
f_woyhh(time) as dt
    ,EXTRACT(YEAR FROM time) AS ctYear
    ,count(*) AS ct
FROM
    ltg_data
    GROUP BY dt,EXTRACT(YEAR FROM time) 
)

As an aside - when an SQL statement that uses WITH clauses or embedded SELECTs doesn't work out as expected, a first step towards solving the issue can be to check the result of those WITH clauses or embedded SELECTs.

To find this particular issue, I executed the following to verify what was going on:

SELECT  f_woyhh(d) as dt,
    EXTRACT(YEAR FROM d::timestamp) AS dtYear 
FROM generate_series(timestamp '2016-01-01', timestamp '2018-12-31', interval '1 hour') as d
WHERE f_woyhh(d) between 2605 and 2608; -- the WHERE clause to just limit the result.

Followed by

WITH
CTE_Dates
AS
(
SELECT  f_woyhh(d) as dt,
    EXTRACT(YEAR FROM d::timestamp) AS dtYear from
        generate_series(timestamp '2016-01-01', timestamp '2018-12-31', interval '1 hour') as d
    -- full range of possible dates
)

SELECT
f_woyhh(time) as dt
    ,time
    ,count(*) AS ct
FROM
    ltg_data
    GROUP BY ltg_data.time
    ORDER BY dt, ltg_data.time;

Which highlighted the problem as there are 5 records (shown above) for 2607.

like image 163
Koen Avatar answered Nov 24 '25 04:11

Koen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!