Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL to combine (de-duplicate) Table A based on Table B data?

How do I write the SQL (SQLite3, postgres, or MySQL) to combine (de-duplicate) Table A based on Table B data?

Given People (table: contacts) that have Phone Numbers (table: phones), I want to use phone numbers to remove duplicate contacts.

In this example there the company is there for clarification, the real data may not have the same company listed even though the person is the same based on phone number.

Note, in this example, there are two distinct Betty's, but the two Charles are the same person and should be combined.

Also, Ashok and Dale have a common phone number because they work for the same company, but are not the same person.

Table: contacts
===============
id      name      company
1       Ashok     Alpha Co.
2       Betty     Beta Inc.
3       Charles   Cain LLC.          <---|
4       Betty     Delta Corp.            |-- same person
5       Charles   Cain LLC.          <---|
6       Dale      Alpha Co.

Table: phones
============
id      phone_number    contact_id
1       (111) 123-1111  1
2       (111) 123-2222  1
3       (111) 123-3333  1
4       (111) 123-4444  1
5       (222) 456-1111  2
6       (222) 456-2222  2
7       (333) 789-1111  3
8       (333) 789-2222  3
9       (333) 789-3333  3
10      (333) 789-4444  3
11      (444) 456-7777  4
12      (444) 456-8888  4
13      (555) 789-5555  5
14      (333) 789-2222  5
15      (111) 123-3333  6

Joining the tables gives:

SELECT      c.id, c.name, c.company, p.phone_number, p.contact_id
FROM        contacts as c
INNER JOIN  phones as p
ON          c.id = p.contact_id

c.id    c.name      c.company       p.phone         p.contact_id
1       Ashok       Alpha Co.       (111) 123-1111  1
1       Ashok       Alpha Co.       (111) 123-2222  1
1       Ashok       Alpha Co.       (111) 123-3333  1
1       Ashok       Alpha Co.       (111) 123-4444  1
2       Betty       Beta Inc.       (222) 456-1111  2
2       Betty       Beta Inc.       (222) 456-2222  2
3       Charles     Cain LLC.       (333) 789-1111  3
3       Charles     Cain LLC.       (333) 789-2222  3
3       Charles     Cain LLC.       (333) 789-3333  3
3       Charles     Cain LLC.       (333) 789-4444  3
4       Betty       Delta Corp.     (444) 456-7777  4
4       Betty       Delta Corp.     (444) 456-8888  4
5       Charles     Cain LLC.       (555) 789-5555  5
5       Charles     Cain LLC.       (333) 789-2222  5
6       Dale        Alpha Co.       (111) 123-3333  6

So what I am thinking is that I would want to loop through all of the distint phone numbers, get all of the contacts for each of those numbers, check to see if the names are the same, if they are remove the duplicate contact and change the contact_ids on the phone numbers.

so the result would look like:

Table: contacts
===============
id      name      company
1       Ashok     Alpha Co.
2       Betty     Beta Inc.
3       Charles   Cain LLC.
4       Betty     Delta Corp.       <-- Note the duplicate Charles (5) is removed
6       Dale      Alpha Co.

Table: phones
============
id      phone_number    contact_id
1       (111) 123-1111  1
2       (111) 123-2222  1
3       (111) 123-3333  1
4       (111) 123-4444  1
5       (222) 456-1111  2
6       (222) 456-2222  2
7       (333) 789-1111  3
8       (333) 789-2222  3
9       (333) 789-3333  3
10      (333) 789-4444  3
11      (444) 456-7777  4
12      (444) 456-8888  4
13      (555) 789-5555  3         <-- Note the contact_id is updated
15      (111) 123-3333  6         <-- Note the duplicate phone number (14) is removed

or

c.id    c.name      c.company       p.phone         p.contact_id
1       Ashok       Alpha Co.       (111) 123-1111  1
1       Ashok       Alpha Co.       (111) 123-2222  1
1       Ashok       Alpha Co.       (111) 123-3333  1
1       Ashok       Alpha Co.       (111) 123-4444  1
2       Betty       Beta Inc.       (222) 456-1111  2
2       Betty       Beta Inc.       (222) 456-2222  2
3       Charles     Cain LLC.       (333) 789-1111  3
3       Charles     Cain LLC.       (333) 789-2222  3
3       Charles     Cain LLC.       (333) 789-3333  3
3       Charles     Cain LLC.       (333) 789-4444  3
4       Betty       Delta Corp.     (444) 456-7777  4
4       Betty       Delta Corp.     (444) 456-8888  4
3       Charles     Cain LLC.       (555) 789-5555  3
6       Dale        Alpha Co.       (111) 123-3333  6
like image 835
MERM Avatar asked Jan 24 '26 21:01

MERM


1 Answers

The following makes an assumption that your problem is as simple as you say it is. In other words, it is only looking for pairs of contacts that are the same, and not traversing a graph that could be much more complicated.

If you are saying that any two contacts with the same phone number (no matter how many they have) and the same name are the same, then you can find them using:

with cp as (
      select c.*, p.phone_number
      from contacts c join
           phones p
           on c.id = p.contact_id
     )
select distinct cp.id as id1, cp2.id as id2
from cp join
     cp cp2
     on cp.phone_number = cp2.phone_number and cp.name = cp2.name and
        cp.id <> cp2.id;

Presumably, you want to keep the first contact. So, let's use aggregation instead. In this

select min(cp.id) as id1, cp2.id as id2
from cp join
     cp cp2
     on cp.phone_number = cp2.phone_number and cp.name = cp2.name and
         cp.id < cp2.id
group by cp2.id;

This produces pairs of contact ids. We want to keep the first and delete the second.

Now, if we make the assumption that duplicates are only one deep, then we can incorporate this into a delete:

with cp as (
      select c.*, p.phone_number
      from contacts c join
           phones p
           on c.id = p.contact_id
     )
delete from contacts
    where id in (select cp2.id
                 from cp join
                      cp cp2
                      on cp.phone_number = cp2.phone_number and cp.name = cp2.name and
                         cp.id < cp2.id
                );

(The group by is actually unnecessary for the in.).

Note: This does not work in MySQL, where the equivalent logic would need to be expressed using JOIN and CTEs are not supported.

like image 101
Gordon Linoff Avatar answered Jan 26 '26 13:01

Gordon Linoff