In SQLite3, I have a large table with 36 million rows. There are two columns in this very huge table:
There are duplicate rows in several of the rows. That is, the values of hash and d are the same. If two hashes are similar, the values of d are also same. Two identical d’s, on the other hand, do not imply two identical hashes.
I’d like to get rid of the duplicate rows. I don’t have a primary key column in my database.
What’s the quickest way to accomplish this?
Asked by Patches
You’ll need a technique to tell the rows apart. You might use the specific rowid column for that, based on your statement.
By maintaining the lowest rowid per (hash,d), duplicates can be removed:
delete from YourTable where rowid not in ( select min(rowid) from YourTable group by hash , d )
Answered by Andomar
I suppose the quickest way would be to use the same database: create a new table with the same columns as the original table, but with proper constraints (a unique index on hash/real pair? ), iterate through the original table and try to insert records in the new table, ignoring constraint violation errors (i.e. continue iterating when exceptions are raised).
Then, instead of deleting the old table, rename the new one to the old one.
Answered by MaDa
If adding a primary key isn’t a possibility, one solution is to store duplicates DISTINCT in a temp database, delete any duplicated items from the existing table, and then add the records from the temp table back into the original table.
For instance (written for SQL Server 2008, but the method applies to any database):
DECLARE @original AS TABLE([hash] varchar(20), [d] float) INSERT INTO @original VALUES('A', 1) INSERT INTO @original VALUES('A', 2) INSERT INTO @original VALUES('A', 1) INSERT INTO @original VALUES('B', 1) INSERT INTO @original VALUES('C', 1) INSERT INTO @original VALUES('C', 1) DECLARE @temp AS TABLE([hash] varchar(20), [d] float) INSERT INTO @temp SELECT [hash], [d] FROM @original GROUP BY [hash], [d] HAVING COUNT(*) > 1 DELETE O FROM @original O JOIN @temp T ON T.[hash] = O.[hash] AND T.[d] = O.[d] INSERT INTO @original SELECT [hash], [d] FROM @temp SELECT * FROM @original
I’m not sure if sqlite has a ROW_NUMBER() type function, but if it does you could also try some of the approaches listed here: Delete duplicate records from a SQL table without a primary key
Answered by rsbarro
Post is based on https://stackoverflow.com/questions/8190541/deleting-duplicate-rows-from-sqlite-database