Coder Perfect

Duplicate rows are being removed from the sqlite database.

Problem

In SQLite3, I have a large table with 36 million rows. There are two columns in this very huge table:

There are duplicate rows in several of the rows. That is, the values of hash and d are the same. If two hashes are similar, the values of d are also same. Two identical d’s, on the other hand, do not imply two identical hashes.

I’d like to get rid of the duplicate rows. I don’t have a primary key column in my database.

What’s the quickest way to accomplish this?

Asked by Patches

Solution #1

You’ll need a technique to tell the rows apart. You might use the specific rowid column for that, based on your statement.

By maintaining the lowest rowid per (hash,d), duplicates can be removed:

delete   from YourTable
where    rowid not in
         (
         select  min(rowid)
         from    YourTable
         group by
                 hash
         ,       d
         )

Answered by Andomar

Solution #2

I suppose the quickest way would be to use the same database: create a new table with the same columns as the original table, but with proper constraints (a unique index on hash/real pair? ), iterate through the original table and try to insert records in the new table, ignoring constraint violation errors (i.e. continue iterating when exceptions are raised).

Then, instead of deleting the old table, rename the new one to the old one.

Answered by MaDa

Solution #3

If adding a primary key isn’t a possibility, one solution is to store duplicates DISTINCT in a temp database, delete any duplicated items from the existing table, and then add the records from the temp table back into the original table.

For instance (written for SQL Server 2008, but the method applies to any database):

DECLARE @original AS TABLE([hash] varchar(20), [d] float)
INSERT INTO @original VALUES('A', 1)
INSERT INTO @original VALUES('A', 2)
INSERT INTO @original VALUES('A', 1)
INSERT INTO @original VALUES('B', 1)
INSERT INTO @original VALUES('C', 1)
INSERT INTO @original VALUES('C', 1)

DECLARE @temp AS TABLE([hash] varchar(20), [d] float)
INSERT INTO @temp
SELECT [hash], [d] FROM @original 
GROUP BY [hash], [d]
HAVING COUNT(*) > 1

DELETE O
FROM @original O
JOIN @temp T ON T.[hash] = O.[hash] AND T.[d] = O.[d]

INSERT INTO @original
SELECT [hash], [d] FROM @temp

SELECT * FROM @original

I’m not sure if sqlite has a ROW_NUMBER() type function, but if it does you could also try some of the approaches listed here: Delete duplicate records from a SQL table without a primary key

Answered by rsbarro

Post is based on https://stackoverflow.com/questions/8190541/deleting-duplicate-rows-from-sqlite-database