Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove Duplicate rows from a large table - PostgreSQL

I want to remove duplicates from a large table having about 1million rows and increasing every hour. It has no unique id and has about ~575 columns but sparsely filled.

The table is 'like' a log table where new entries are appended every hour without unique timestamp.

The duplicates are like 1-3% but I want to remove it anyway ;) Any ideas?

I tried ctid column (as here) but its very slow.

like image 559
UserBSS1 Avatar asked Oct 20 '25 19:10

UserBSS1


1 Answers

The basic idea that works generally well with PostgreSQL is to create an index on the hash of the set of columns as a whole.

Example:

CREATE INDEX index_name ON tablename (md5((tablename.*)::text));

This will work unless there are columns that don't play well with the requirement of immutability (mostly timestamp with time zone because their cast-to-text value is session-dependent).

Once this index is created, duplicates can be found quickly by self-joining with the hash, with a query looking like this:

SELECT t1.ctid, t2.ctid
FROM tablename t1 JOIN tablename t2
 ON (md5((t1.*)::text) = md5((t2.*)::text))
WHERE t1.ctid > t2.ctid;

You may also use this index to avoid duplicates rows in the future rather than periodically de-duplicating them, by making it UNIQUE (duplicate rows would be rejected at INSERT or UPDATE time).

like image 196
Daniel Vérité Avatar answered Oct 23 '25 09:10

Daniel Vérité