I need to count distinct values that have been spread across more than one column. I have a method but suspect there is a more elegant way.
Example input:
year | project | staff 1 | staff 2 | staff 3 | cost |
---|---|---|---|---|---|
2022 | A | Ian | 100 | ||
2023 | A | Jim | Anne | 200 | |
2024 | A | Anne | Jim | Peter | 300 |
2023 | B | Anne | Sue | 400 | |
2024 | B | Dave | 500 |
Expected output:
project | number_distinct_staff | total_cost |
---|---|---|
A | 4 | 600 |
B | 3 | 900 |
Current method:
SELECT project
, COUNT(DISTINCT staff) AS num_distinct_staff
, SUM(cost) AS total_cost
FROM (
SELECT project, staff_1 AS staff, cost AS cost FROM table UNION ALL
SELECT project, staff_2 AS staff, NULL AS cost FROM table UNION ALL
SELECT project, staff_3 AS staff, NULL AS cost FROM table
) AS sub
GROUP BY project
This approach feels clumsy to me. I am worried that the asymmetric handling of columns (like cost
in the example) makes the process vulnerable to double counting.
Is there a better way?
Please try the following solution leveraging a Table Value Constructor.
For the refence: The Table Value Constructor (aka the VALUES keyword)
Basically, we are converting columns to rows via VALUES (...)
db-fiddle
SQL
-- DDL and sample data population, start
DECLARE @tbl TABLE (project CHAR(1), staff_1 VARCHAR(20), staff_2 VARCHAR(20), staff_3 VARCHAR(20), cost INT);
INSERT @tbl (project, staff_1, staff_2, staff_3, cost) VALUES
('A', NULL, 'Ian', NULL, 100),
('A', 'Jim', 'Anne', NULL, 200),
('A', 'Anne', 'Jim', 'Peter', 300),
('B', 'Anne', 'Sue', NULL, 400),
('B', NULL, NULL, 'Dave', 500);
-- DDL and sample data population, end
SELECT t1.project
, COUNT(DISTINCT staff) AS num_distinct_staff
, SUM(t1.cost) AS total_cost
FROM @tbl AS t
CROSS APPLY (VALUES
(project, staff_1, cost),
(project, staff_2, NULL),
(project, staff_3, NULL)) AS t1(project, staff, cost)
GROUP BY t1.project;
Output
project | num_distinct_staff | total_cost |
---|---|---|
A | 4 | 600 |
B | 3 | 900 |
First off, your solution is correct you just had a little error,
see my dbFiddle
-- your solution after correction
SELECT project
, COUNT(DISTINCT staff) AS num_distinct_staff
, SUM(cost) AS total_cost
FROM (
SELECT project, staff_1 as staff, cost FROM tbl UNION ALL
SELECT project, staff_2, NULL AS staff FROM tbl UNION ALL
SELECT project, staff_3, NULL AS staff FROM tbl
) AS sub
GROUP BY project
If you want to try something other than your solution or the other answer, another way is to use both string_agg
and string_split
like this
select t.project,
count(distinct value) as number_distinct_staff,
min(t.costs) as total_cost
from ( select project,
sum(cost) as costs,
string_agg(staff_1, ',') + ',' + string_agg(staff_2, ',') + ',' + string_agg(staff_3, ',') as staffs
from tbl
group by project
) t
cross apply string_split(t.staffs, ',')
group by t.project
Here is the dbFiddle
And the result is
project | number_distinct_staff | total_cost |
---|---|---|
A | 4 | 600 |
B | 3 | 900 |
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With