I have a Person table with huge number of records(for about 16 million), and have a requirement to find all persons, with same lastname, first letter of firstname and birthyear, in other worlds I want to show assuming duplicate persons in UI for users to analyze and decide are there a same person or not.
Here is the query I write
SELECT * 
FROM Person INNER JOIN
(
    SELECT SUBSTRING(firstName, 1, 1) firstNameF,lastName,YEAR(birthDate) birthYear
    FROM Person
    GROUP BY SUBSTRING(firstName, 1,1),lastName,YEAR(birthDate)
    HAVING count(*) > 1
) as dupPersons 
ON SUBSTRING(Person.firstName,1,1) = dupPersons.firstNameF and Person.lastName = dupPersons.lastName and YEAR(Person.birthDate) = dupPersons.birthYear
order by Person.lastName,Person.firstName
but as I am not SQL expert, want too know, is this good way to do that? are there more optimized way?
EDIT
Note that I can cut data, which can have contribution in optimization
for example if I want to cut data by 2 it could return two persons
Johan Smith |
Jane Smith  | have same lastname and first name inita 
Jack Smith  |
Mark Tween  | have same lastname and first name inita 
Mac Tween   |
If the performance using a GROUP BY is not adequate, You could try using an INNER JOIN
SELECT  *
FROM    Person p1
        INNER JOIN Person p2 ON p2.PersonID > p1.PersonID
WHERE   SUBSTRING(p2.Firstname, 1, 1) = SUBSTRING(p1.Firstname, 1, 1) 
        AND p2.LastName = p1.LastName
        AND YEAR(p2.BirthDate) = YEAR(p1.BirthDate)
ORDER BY
        p1.LastName, p1.FirstName        
Well, if you're not an expert, the query you wrote says to me that you're at least pretty competent.  When we look at whether a query is "optimized", there are two immediate parts to that:  1. The query just on its own has something notably wrong with it - a bad join, keyword misuse, exploding result set size, supersitions about NOT IN, etc.  2. The context that the query operates within - DB specifics, task specifics, etc.
Your query passes #1, no problem.  I would have written it differently - aliased the Person table, used LEFT(P.FirstName, 1) instead of SUBSTRING, and used a CTE (WITH-clause) instead of a subquery.  But these aren't optimization issues.  Maybe I'd use WITH(READUNCOMMITTED) if the results weren't sensitive to dirty reads.  Out of any further context, your query doesn't look like a bomb waiting to go off.
As for #2 - You should probably switch to specifics. Like "I have to run this every week. It takes 17 minutes. How can I get it down to under a minute?" Then people will ask you what your plan looks like, what indexes you have, etc.
Things I'd want to know:
Example scenario:  If this was a run-on-command feature, it will be in my app indefinitely, it will get run weekly, with 10% or fewer records expected to be duplicates, with ability to change the DB how I'd like, if the duplicate matching criteria is firm (not fluctuating), and I wan to cut it from 90s to 5s, I'd create a dedicated BirthYear column (possibly a persisted computed column off of BirthDate), and an index on LastName ASC, BirthYear ASC, FirstName ASC.  If too many of those stipulations change, I might to a different direction entirely.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With