MS SQL, fastest way to delete 99% of rows from 100+ million rows table to deduplicate data

Question

Due to an error in our application, one of the tables in our MS SQL has been flooded with duplicate rows differing only in PK identity column and timestamp

+--------+---------------------+--------------------+
| id     | time                |       value        |
+--------+---------------------+--------------------+
| #10003 | 2014-02-26 12:00:00 |   uniq_pattern_1   |
| #20005 | 2014-02-26 13:00:00 |   uniq_pattern_1   |
| #30018 | 2014-02-26 14:00:00 |   uniq_pattern_1   |
| #40022 | 2014-02-26 15:00:00 |   uniq_pattern_1   |
 ....................................................
| #10005 | 2014-02-26 12:00:00 |   uniq_pattern_2   |
| #20008 | 2014-02-26 13:00:00 |   uniq_pattern_2   |
| #30020 | 2014-02-26 14:00:00 |   uniq_pattern_2   |
| #40040 | 2014-02-26 15:00:00 |   uniq_pattern_2   |
+--------+---------------------+--------------------+

There is 100+ million rows with only about 50k unique patterns spread across entire table (every existing pattern was re-inserted every hour, with a few new patterns added to the mix). Table only has clustered PK index. I need to delete all duplicates leaving only earliest occurrence of each pattern.

I plan to do the following:

Grab earliest row for each uniq pattern into temp table.

INSERT INTO #MyTempTable
SELECT [value], min([time])
FROM [MyBigTable]
GROUP BY [value]

Truncate [MyBigTable] then populate it from #MyTempTable.

Is this a good way to do this?

If nothing depends on your primary key, then this is a viable solution. You could also just run multiple batch deletes from the table - this would remove the need for a temp table. — Mr.Brownstone, Dec 28 '18 at 21:42
I wouldn't use a temp. What happens if you accidentally click X on your SSMS tab or your computer (or SSMS) crashes after the truncation before you reinsert them? That temp table would cease to exist. Instead, I would use a persisted / actual table. Call it MyBigTable2. Then you can use it like you intended, or simply drop the MyBigTable (or rename it to MyBigTable_Old for a while) and then rename your "new" table, which is where you inserted your unique records aka MyBigTable2 to MyBigTable. — S3S, Dec 28 '18 at 21:44
Create a physical table with different schema , insert required data and then do a schema switch. Test it in a non prod env and then implement it. — Kin Shah, Dec 28 '18 at 21:51
Also make sure the table is locked for the duration to prevent rows being lost if inserted after the snapshot of rows to keep is taken. — Martin Smith, Dec 28 '18 at 22:02

MS SQL, fastest way to delete 99% of rows from 100+ million rows table to deduplicate data

0 Answers0