I want to remove duplicates from a large table having about 1million rows and increasing every hour. It has no unique id and has about ~575 columns but sparsely filled.
The table is 'like' a log table where new entries are appended every hour without unique timestamp.
The duplicates are like 1-3% but I want to remove it anyway ;) Any ideas?
I tried ctid column (as here) but its very slow.
The basic idea that works generally well with PostgreSQL is to create an index on the hash of the set of columns as a whole.
Example:
CREATE INDEX index_name ON tablename (md5((tablename.*)::text));
This will work unless there are columns that don't play well with the requirement of immutability (mostly timestamp with time zone because their cast-to-text value is session-dependent).
Once this index is created, duplicates can be found quickly by self-joining with the hash, with a query looking like this:
SELECT t1.ctid, t2.ctid
FROM tablename t1 JOIN tablename t2
ON (md5((t1.*)::text) = md5((t2.*)::text))
WHERE t1.ctid > t2.ctid;
You may also use this index to avoid duplicates rows in the future rather than periodically de-duplicating them, by making it UNIQUE (duplicate rows would be rejected at INSERT or UPDATE time).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With