On occasion when building a query I know I will have duplicates due to one of the joined tables. Is it better practice to remove those duplicates using a more rigorous join statement and/or using SELECT DISTINCT in a subquery, rather than using a SELECT DISTINCT statement for the entire thing?
For example:
SELECT PL.STATION,
PL.PART_NUMBER,
PD.DESCRIPTION
FROM SAMPLE_DB.PART_LOCATOR PL
LEFT JOIN PROD_SAMPLE_DB.PART_DETAIL PD
ON PL.PART_NUMBER = PD.PART_NUMBER
WHERE PL.PART_NUMBER = '4077968'
AND PL.STATION = 'ATL'
4 rows returned.
Obviously I could just use a SELECT DISTINCT to get unique values:
SELECT DISTINCT PL.STATION,
PL.PART_NUMBER,
PD.DESCRIPTION
FROM SAMPLE_DB.PART_LOCATOR PL
LEFT JOIN PROD_SAMPLE_DB.PART_DETAIL PD
ON PL.PART_NUMBER = PD.PART_NUMBER
WHERE PL.PART_NUMBER = '4077968'
AND PL.STATION = 'ATL'
1 row returned.
But is this better form (making the join more specific)?
SELECT PL.STATION,
PL.PART_NUMBER,
PD.DESCRIPTION
FROM SAMPLE_DB.PART_LOCATOR PL
LEFT JOIN PROD_SAMPLE_DB.PART_DETAIL PD
ON PL.PART_NUMBER = PD.PART_NUMBER
AND PL.BIN_NUMBER = PD.BIN_NUMBER <--- Additional join criteria
WHERE PL.PART_NUMBER = '4077968'
AND PL.STATION = 'ATL'
1 row returned.
Or this:
SELECT PL.STATION,
PL.PART_NUMBER,
PD.DESCRIPTION
FROM SAMPLE_DB.PART_LOCATOR PL
LEFT JOIN
(
SELECT DISTINCT <--- Add distinct statement in subquery
PART_NUMBER,
DESCRIPTION
FROM
PROD_SAMPLE_DB.PART_DETAIL
) PD
ON PL.PART_NUMBER = PD.PART_NUMBER
WHERE PL.PART_NUMBER = '4077968'
AND PL.STATION = 'ATL'
1 row returned.
What is considered best practice?
There is a best practice. If this is the query that you intend, then use it!
SELECT PL.STATION, PL.PART_NUMBER, PD.DESCRIPTION
FROM SAMPLE_DB.PART_LOCATOR PL LEFT JOIN
PROD_SAMPLE_DB.PART_DETAIL PD
ON PL.PART_NUMBER = PD.PART_NUMBER AND
PL.BIN_NUMBER = PD.BIN_NUMBER <--- Additional join criteria
WHERE PL.PART_NUMBER = '4077968' AND PL.STATION = 'ATL';
As a general rule, SELECT DISTINCT incurs a fair amount of overhead for the query. Hence, you should avoid it or use it sparingly. The idea of generating duplicate rows using JOIN just to remove them with SELECT DISTINCT is rather reminiscent of Sisyphus pushing a rock up a hill, only to have it roll back down again.
That doesn't mean that you shouldn't use it when appropriate.
Conclusion: If adding one or more join conditions does what you want, then that is the best thing to do.
Otherwise, doing the DISTINCT before the JOIN is likely to be a good idea. However, there are exceptions to this rule. Doing the DISTINCT after the JOIN makes sense under some circumstances:
WHERE clause is highly selective (meaning that removing the unique elements in a subquery wastes a lot of effort on rows that are not in the final result set).I don't know if there is necessarily a best practice in this case. Typically, you always want your join semantics to match the logical representation of your data and the desired result. In other words, if logically the join criteria should be across multiple columns because this is how part descriptions logically relate to a part, then that is what you should do.
Otherwise, I always default to what is more readable from a query perspective and how the query will perform. Placing the DISTINCT inside a sub-query results in a somewhat bloated query for others to try and understand. Also, the DBMS might even perform worse from a performance perspective with the sub-query approach since it could cause indexes to be eliminated).
Obviously, I don't understand your data model, but each of these queries could actually produce different results depending on how a part description is logically associated to a part.
It appears that a part description is associated to a part based on both a part number and bin number (this fact is based on the structure of the PART_DETAIL table). However, in this specific case, all part numbers (but with different bin numbers) happen to have the same part description.
What if the descriptions were not the same (e.g. different for each part number/bin number combination). Then only your more specific join would return the correct results (e.g. the part description for a specific part). So again, it all goes back to writing your query such that the logic used matches the logical representation of your data and the result set you are looking for.
This is just my 2 cents.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With