I have read many threads on this subject now and tried a few things but it has not worked as I hoped. I need some clarification and apologize if this is considered a duplicate thread.
A client of mine hosts a Postgres database where one table holds a little more then 12 million records. They have tasked me with finding duplicate records, extract them for viewing and if everything looks ok, delete the duplicates.
My main concern has been performance on the server. Running DISTINCT queries on 12 million records must consume a lot of resources?
Since my first task is to extract the records for viewing in, say a CSV, and not simply deleting them my approach in PgAdmin was executing this to a file.
SELECT * 
FROM
    my_table
WHERE
my_table_id NOT IN (
                SELECT DISTINCT 
                    ON (
                        num_1,
                        num_2,
                        num_3,
                        num_4,
                        num_5,
                        my_date
                    )
                    my_table_id
                FROM
                    my_table
);
However this query takes way to long. After 20 minutes of execution time I halted the execution. To make things more complex my client is reluctant to allow me to clone a local copy of the table because of strict security. They prefer it is all done on the live hosting environment.
The table definition is quite simple. It looks like this
CREATE TABLE my_table
(
    my_table_id bigserial NOT NULL,
    num_1 bigserial NOT NULL,
    num_2 bigserial NOT NULL,
    num_3 bigserial NOT NULL,
    num_4 numeric,
    num_5 integer,
    my_date date,
    my_text character varying
)
The primary key "my_table_id" has not been compromised and is always unique. The col "my_text" is not interesting in the query since it will be empty for all duplicates. It is only the numeric fields and the date that needs matching. All columns (except my_table_id and my_text) must match across records to qualify as a duplicate.
What is the best way to solve this? Is there a server-friendly way that won´t eat all resources on the host environment? Please help me understand the best approach!
Thanks you!
Need to use GROUP BY and HAVING to get duplicate records instead of DISTINCT
subquery will find all duplicate records
SELECT * FROM
my_table mt
JOIN
(
     SELECT
            num_1,
            num_2,
            num_3,
            num_4,
            num_5,
            my_date
     FROM
            my_table
     GROUP BY num_1, num_2, num_3, num_4, num_5, my_date
     HAVING COUNT(*) >1
) T 
ON mt.num_1= T.num_1
and mt.num_2= T.num_2
and mt.num_3= T.num_3
and mt.num_4= T.num_4
and mt.num_5= T.num_5
and mt.my_date= T.my_date
Another way using analytic functions
select * from (
    select * , 
    count(*) over (partition by num1,num2,num3,num4,my_date) cnt
    from mytable
) t1 where cnt > 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With