Coder Perfect

How can you get rid of redundant entries?

Problem

I need to add a unique constraint to a table that already exists. This is excellent except for the fact that the table already includes millions of records, many of which violate the unique constraint I need to add.

What is the quickest way to get rid of the troublesome rows? I have a SQL statement that discovers and deletes duplicates, but it takes forever to execute. Is there a different way to approach this issue? Perhaps backing up the table first, then restoring it when the constraint has been added?

Asked by gjrwebber

Solution #1

Some of these approaches seem a little complicated, and I generally do this as:

Want to make a table unique on (field1, field2) while preserving the row with the maximum field3:

DELETE FROM table USING table alias 
  WHERE table.field1 = alias.field1 AND table.field2 = alias.field2 AND
    table.max_field < alias.max_field

I have a database called user accounts, and I’d like to add a unique constraint on email, but there are some duplicates. Also, I’d want to mention that I’d like to keep the most recently produced one (max id among duplicates).

DELETE FROM user_accounts USING user_accounts ua2
  WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;

Answered by Tim

Solution #2

For instance, you could:

CREATE TABLE tmp ...
INSERT INTO tmp SELECT DISTINCT * FROM t;
DROP TABLE t;
ALTER TABLE tmp RENAME TO t;

Answered by just somebody

Solution #3

You can re-insert unique rows into the same table after truncating it instead of generating a new one. All of this can be done in a single transaction.

This method is only useful when there are a large number of rows to remove from the table. Use a simple DELETE for a few duplicates.

You mentioned that there are millions of rows. To make the operation go as quickly as possible, you’ll need to set aside enough temporary buffers for the session. The setting has to be adjusted before any temp buffer is used in your current session. Find out the size of your table:

SELECT pg_size_pretty(pg_relation_size('tbl'));

Set temp buffers to a value that is at least a little higher.

SET temp_buffers = 200MB;   -- example value

BEGIN;

CREATE TEMP TABLE t_tmp AS  -- retains temp for duration of session
SELECT DISTINCT * FROM tbl  -- DISTINCT folds duplicates
ORDER  BY id;               -- optionally "cluster" data

TRUNCATE tbl;

INSERT INTO tbl
SELECT * FROM t_tmp;        -- retains order (implementation detail)

COMMIT;

If dependent objects exist, this method may be preferable to establishing a new table. The table is referenced by views, indexes, foreign keys, and other objects. With large tables, TRUNCATE is much faster than DELETE FROM tbl since it starts with a clean slate (new file in the background) (DELETE can actually be faster with small tables).

Dropping indexes and foreign keys (FK), refilling the table, and recreating these objects is frequently faster for large tables. While it comes to FK restrictions, you must be assured that the new data is valid, or you will encounter exceptions when attempting to establish the FK.

TRUNCATE, unlike DELETE, necessitates more forceful locking. This could be a problem for tables with a lot of concurrent users. However, it is still less disruptive than entirely removing and replacing the table.

There is a similar strategy with a data-modifying CTE (Postgres 9.1+) if TRUNCATE is not an option or for small to medium tables in general:

WITH del AS (DELETE FROM tbl RETURNING *)
INSERT INTO tbl
SELECT DISTINCT * FROM del;
ORDER  BY id; -- optionally "cluster" data while being at it.

TRUNCATE is faster on large tables, thus it takes longer. However, for tiny tables, it may be faster (and simpler!).

If you don’t have any dependent items, you could construct a new table and delete the old one, but you won’t gain anything by doing so.

Creating a new table will be much faster for really large tables that would not fit into available RAM. You’ll have to balance this against the potential for complications / overhead with dependent items.

Answered by Erwin Brandstetter

Solution #4

You can use oid or ctid, which are generally “invisible” table columns:

DELETE FROM table
 WHERE ctid NOT IN
  (SELECT MAX(s.ctid)
    FROM table s
    GROUP BY s.column_has_be_distinct);

Answered by Jan Marek

Solution #5

This is where the PostgreSQL window function comes in helpful.

DELETE FROM tablename
WHERE id IN (SELECT id
              FROM (SELECT id,
                             row_number() over (partition BY column1, column2, column3 ORDER BY id) AS rnum
                     FROM tablename) t
              WHERE t.rnum > 1);

See Deleting duplicates.

Answered by shekwi

Post is based on https://stackoverflow.com/questions/1746213/how-to-delete-duplicate-entries