Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Put many columns in group by clause in Oracle SQL

In Oracle 11g database, Suppose we have table, CUSTOMER and PAYMENT as follows

Customer

CUSTOMER_ID | CUSTOMER_NAME | CUSTOMER_AGE | CUSTOMER_CREATION_DATE
--------------------------------------------------------------------
001                     John             30              1 Jan 2017
002                     Jack             10              2 Jan 2017
003                      Jim             50              3 Jan 2017

Payment

CUSTOMER_ID | PAYMENT_ID | PAYMENT_AMOUNT | 
-------------------------------------------
001                   900            100.00
001                   901            200.00
001                   902            300.00
003                   903            999.00

We want to write an SQL to get all columns from table CUSTOMER together with the sum of all payment of each customer. There are many possible ways to do this but I would like to ask which one of the following is better.

Solution 1

SELECT C.CUSTOMER_ID
, MAX(C.CUSTOMER_NAME) CUSTOMER_NAME
, MAX(C.CUSTOMER_AGE) CUSTOMER_AGE
, MAX(C.CUSTOMER_CREATION_DATE) CUSTOMER_CREATION_DATE
, SUM(P.PAYMENT_AMOUNT) TOTAL_PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN PAYMENT P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
GROUP BY C.CUSTOMER_ID;

Solution 2

SELECT C.CUSTOMER_ID
, C.CUSTOMER_NAME
, C.CUSTOMER_AGE
, C.CUSTOMER_CREATION_DATE
, SUM(P.PAYMENT_AMOUNT) PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN PAYMENT P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
GROUP BY C.CUSTOMER_ID, C.CUSTOMER_NAME, C.CUSTOMER_AGE, C.CUSTOMER_CREATION_DATE

Please notice in Solution 1 that I use MAX not because I actually want the max results, but I because I want "ONE" row from the columns which I know are equal for all rows with the same CUSTOMER_ID

While in solution 2, I avoid putting the misleading MAX in SELECT part by putting the columns in GROUP BY part instead.

With my current knowledge, I prefer Solution 1 because it is more important to comprehend the logic in GROUP BY part than in the SELECT part. I would put only a set of unique keys to express the intention of the query, so the application can infer the expected number of rows. But I don't know about the performance.

I ask this question because I am reviewing a code change of a big SQL that put 50 columns in the GROUP BY clause because the editor want avoid the MAX function in SELECT part. I know we can refactor the query in someway to avoid putting the irrelevant columns in both GROUP BY and SELECT part, but please discard that option because it will affect the application logic and require more time to do the test.


Update

I have just done the test on my big query in both versions as everyone suggested. The query is complex, it has 69 lines involving more than 20 tables and the execution plan is more than 190 lines, so I think this is not the place to show it.

My production data is quite small now, it has about 4000 customers and the query was run against the whole database. Only table CUSTOMER and a few reference table has TABLE ACCESS FULL in the execution plan, the others tables have access by indexes. The execution plans for both versions have a little bit difference in join algorithm (HASH GROUP BY vs SORT AGGREGATE) on some part.

Both versions use about 13 minutes, no significant difference.

I also have done the test on the simplified versions similar to the SQL in the question. Both version has exactly the same execution plan and elapse time.

With the current information, I think the most reasonable answer is that it is unpredictable unless test to decide the quality of both versions as the optimizer will do the job. I will very appreciate if anyone could give any information to convince or reject this idea.

like image 580
asinkxcoswt Avatar asked Jan 18 '26 10:01

asinkxcoswt


2 Answers

Another option is

SELECT C.CUSTOMER_ID
, C.CUSTOMER_NAME
, C.CUSTOMER_AGE
, C.CUSTOMER_CREATION_DATE
, P.PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN (
 SELECT CUSTOMER_ID, SUM(PAYMENT_AMOUNT) PAYMENT_AMOUNT
 FROM PAYMENT 
 GROUP BY CUSTOMER_ID
) P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)

To decide which one of three is better just test them and see the execution plans.

like image 195
Serg Avatar answered Jan 21 '26 00:01

Serg


Neither. Do the sum on payment, then join the results.

select C.*, p.total_payment -- c.* gets all columns from table alias c without typing them all out
from Customer C
left join -- I've used left in case you want to include customers with no orders
(
select customer_id, sum(payment_amount) as total_payment
from Payment
group by customer_id
) p
on p.customer_id = c.customer_id
like image 20
JohnHC Avatar answered Jan 21 '26 02:01

JohnHC