I can't seem to find a simple answer for this, also I am a beginner at SQL and I'm doing this in Amazon Athena. I would like to have a distinct on one column, but return several results that do not have distinct on it. This is my code:
SELECT DISTINCT line_item_resource_id
FROM table
WHERE product_servicename = 'Amazon Elastic Compute Cloud'
AND line_item_usage_account_id = '544934960'
AND line_item_usage_type LIKE '%BoxUsage%'
AND identity_time_interval = '2020-06-29T00:00:00Z/2020-06-30T00:00:00Z';
I want the distinct to ONLY be on line_item_resource_id but return all of these:
line_item_resource_id, line_item_usage_start_date,
line_item_usage_end_date, line_item_usage_account_id,
line_item_availability_zone, line_item_product_code, product_instance_type,
pricing_term, product_operating_system, product_servicename,
line_item_line_item_type, line_item_usage_type, line_item_operation,
line_item_usage_amount
This code results in only line_item_resource_id coming out. How do I get the distinct only on that column but return the rest?
I want to suggest here another solution, to use ROW_NUMBER()
I'll show here the basic solution, of course ROW_NUMBER() has more possibilities (like doing order by in the partition, and much more..)
In this solution, you do NOT need to write every column with an aggregate function before it, you can just use *. This makes the query much shorter and cleaner..
So you can do:
WITH tmp_table AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY line_item_resource_id) rn
FROM table
WHERE product_servicename = 'Amazon Elastic Compute Cloud'
AND line_item_usage_account_id = '544934960'
AND line_item_usage_type LIKE '%BoxUsage%'
AND identity_time_interval = '2020-06-29T00:00:00Z/2020-06-30T00:00:00Z'
)
SELECT *
FROM tmp_table
WHERE rn = 1
Maryam's answer is correct, and here is a more detailed version, using the ARBITRARY function available in Athena, as well as SUM:
SELECT
line_item_resource_id,
MIN(line_item_usage_start_date) AS line_item_usage_start_date,
MAX(line_item_usage_end_date) AS line_item_usage_end_date,
ARBITRARY(line_item_usage_account_id) AS line_item_usage_account_id,
ARBITRARY(line_item_availability_zone) AS line_item_availability_zone,
ARBITRARY(line_item_product_code) AS line_item_product_code,
ARBITRARY(product_instance_type) AS product_instance_type,
ARBITRARY(pricing_term) AS pricing_term,
ARBITRARY(product_operating_system) AS product_operating_system,
ARBITRARY(product_servicename) AS product_servicename,
ARBITRARY(line_item_line_item_type) AS line_item_line_item_type,
ARBITRARY(line_item_usage_type) AS line_item_usage_type,
ARBITRARY(line_item_operation) AS line_item_operation,
SUM(line_item_usage_amount) AS line_item_usage_amount
FROM table
WHERE product_servicename = 'Amazon Elastic Compute Cloud'
AND line_item_usage_account_id = '544934960'
AND line_item_usage_type LIKE '%BoxUsage%'
AND identity_time_interval = '2020-06-29T00:00:00Z/2020-06-30T00:00:00Z'
GROUP BY line_item_resource_id
What's going on here is that by grouping on line_item_resource_id each distinct resource ID will end up as a single line in the result – but since each distinct value of that column will appear on multiple rows in the data we need to tell Athena how to flatten all those rows into a single row, otherwise Athena doesn't know how to produce the result you say you want.
The way this is done is through aggregate functions. These take multiple values and produce a single value. When the column is a number, it's common to want to sum the values of the group, and I've done that in my example above with the line_item_usage_amount column, since I know this data set and I know that's a column you want to sum.
For other columns that have string data, like pricing_term how you flatten it depends on what you want. Most of the other columns will have only one value for the same resource ID, like pricing_term and product_servicename. In Athena there's a function called ARBITRARY which does what it says: it picks an arbitrary (non-null) value from the group. When all values are the same, you can pick an arbitrary value, it doesn't matter. When there are multiple values but you don't care about which gets picked, this function is also the best to use.
There are some situations when there could be multiple values for a column in a group and there is some order to them, for example line_item_usage_start_date, and line_item_usage_end_date. In this case you can use MIN and MAX to get the first or last values.
In situations where there are multiple values and you want to pick one specific there are lots of aggregate functions to choose from, and you can make pretty complex choices.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With