How do I write the SQL (SQLite3, postgres, or MySQL) to combine (de-duplicate) Table A based on Table B data?
Given People (table: contacts) that have Phone Numbers (table: phones), I want to use phone numbers to remove duplicate contacts.
In this example there the company is there for clarification, the real data may not have the same company listed even though the person is the same based on phone number.
Note, in this example, there are two distinct Betty's, but the two Charles are the same person and should be combined.
Also, Ashok and Dale have a common phone number because they work for the same company, but are not the same person.
Table: contacts
===============
id name company
1 Ashok Alpha Co.
2 Betty Beta Inc.
3 Charles Cain LLC. <---|
4 Betty Delta Corp. |-- same person
5 Charles Cain LLC. <---|
6 Dale Alpha Co.
Table: phones
============
id phone_number contact_id
1 (111) 123-1111 1
2 (111) 123-2222 1
3 (111) 123-3333 1
4 (111) 123-4444 1
5 (222) 456-1111 2
6 (222) 456-2222 2
7 (333) 789-1111 3
8 (333) 789-2222 3
9 (333) 789-3333 3
10 (333) 789-4444 3
11 (444) 456-7777 4
12 (444) 456-8888 4
13 (555) 789-5555 5
14 (333) 789-2222 5
15 (111) 123-3333 6
Joining the tables gives:
SELECT c.id, c.name, c.company, p.phone_number, p.contact_id
FROM contacts as c
INNER JOIN phones as p
ON c.id = p.contact_id
c.id c.name c.company p.phone p.contact_id
1 Ashok Alpha Co. (111) 123-1111 1
1 Ashok Alpha Co. (111) 123-2222 1
1 Ashok Alpha Co. (111) 123-3333 1
1 Ashok Alpha Co. (111) 123-4444 1
2 Betty Beta Inc. (222) 456-1111 2
2 Betty Beta Inc. (222) 456-2222 2
3 Charles Cain LLC. (333) 789-1111 3
3 Charles Cain LLC. (333) 789-2222 3
3 Charles Cain LLC. (333) 789-3333 3
3 Charles Cain LLC. (333) 789-4444 3
4 Betty Delta Corp. (444) 456-7777 4
4 Betty Delta Corp. (444) 456-8888 4
5 Charles Cain LLC. (555) 789-5555 5
5 Charles Cain LLC. (333) 789-2222 5
6 Dale Alpha Co. (111) 123-3333 6
So what I am thinking is that I would want to loop through all of the distint phone numbers, get all of the contacts for each of those numbers, check to see if the names are the same, if they are remove the duplicate contact and change the contact_ids on the phone numbers.
so the result would look like:
Table: contacts
===============
id name company
1 Ashok Alpha Co.
2 Betty Beta Inc.
3 Charles Cain LLC.
4 Betty Delta Corp. <-- Note the duplicate Charles (5) is removed
6 Dale Alpha Co.
Table: phones
============
id phone_number contact_id
1 (111) 123-1111 1
2 (111) 123-2222 1
3 (111) 123-3333 1
4 (111) 123-4444 1
5 (222) 456-1111 2
6 (222) 456-2222 2
7 (333) 789-1111 3
8 (333) 789-2222 3
9 (333) 789-3333 3
10 (333) 789-4444 3
11 (444) 456-7777 4
12 (444) 456-8888 4
13 (555) 789-5555 3 <-- Note the contact_id is updated
15 (111) 123-3333 6 <-- Note the duplicate phone number (14) is removed
or
c.id c.name c.company p.phone p.contact_id
1 Ashok Alpha Co. (111) 123-1111 1
1 Ashok Alpha Co. (111) 123-2222 1
1 Ashok Alpha Co. (111) 123-3333 1
1 Ashok Alpha Co. (111) 123-4444 1
2 Betty Beta Inc. (222) 456-1111 2
2 Betty Beta Inc. (222) 456-2222 2
3 Charles Cain LLC. (333) 789-1111 3
3 Charles Cain LLC. (333) 789-2222 3
3 Charles Cain LLC. (333) 789-3333 3
3 Charles Cain LLC. (333) 789-4444 3
4 Betty Delta Corp. (444) 456-7777 4
4 Betty Delta Corp. (444) 456-8888 4
3 Charles Cain LLC. (555) 789-5555 3
6 Dale Alpha Co. (111) 123-3333 6
The following makes an assumption that your problem is as simple as you say it is. In other words, it is only looking for pairs of contacts that are the same, and not traversing a graph that could be much more complicated.
If you are saying that any two contacts with the same phone number (no matter how many they have) and the same name are the same, then you can find them using:
with cp as (
select c.*, p.phone_number
from contacts c join
phones p
on c.id = p.contact_id
)
select distinct cp.id as id1, cp2.id as id2
from cp join
cp cp2
on cp.phone_number = cp2.phone_number and cp.name = cp2.name and
cp.id <> cp2.id;
Presumably, you want to keep the first contact. So, let's use aggregation instead. In this
select min(cp.id) as id1, cp2.id as id2
from cp join
cp cp2
on cp.phone_number = cp2.phone_number and cp.name = cp2.name and
cp.id < cp2.id
group by cp2.id;
This produces pairs of contact ids. We want to keep the first and delete the second.
Now, if we make the assumption that duplicates are only one deep, then we can incorporate this into a delete:
with cp as (
select c.*, p.phone_number
from contacts c join
phones p
on c.id = p.contact_id
)
delete from contacts
where id in (select cp2.id
from cp join
cp cp2
on cp.phone_number = cp2.phone_number and cp.name = cp2.name and
cp.id < cp2.id
);
(The group by is actually unnecessary for the in.).
Note: This does not work in MySQL, where the equivalent logic would need to be expressed using JOIN and CTEs are not supported.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With