1

Due to an error in our application, one of the tables in our MS SQL has been flooded with duplicate rows differing only in PK identity column and timestamp

+--------+---------------------+--------------------+
| id     | time                |       value        |
+--------+---------------------+--------------------+
| #10003 | 2014-02-26 12:00:00 |   uniq_pattern_1   |
| #20005 | 2014-02-26 13:00:00 |   uniq_pattern_1   |
| #30018 | 2014-02-26 14:00:00 |   uniq_pattern_1   |
| #40022 | 2014-02-26 15:00:00 |   uniq_pattern_1   |
 ....................................................
| #10005 | 2014-02-26 12:00:00 |   uniq_pattern_2   |
| #20008 | 2014-02-26 13:00:00 |   uniq_pattern_2   |
| #30020 | 2014-02-26 14:00:00 |   uniq_pattern_2   |
| #40040 | 2014-02-26 15:00:00 |   uniq_pattern_2   |
+--------+---------------------+--------------------+

There is 100+ million rows with only about 50k unique patterns spread across entire table (every existing pattern was re-inserted every hour, with a few new patterns added to the mix). Table only has clustered PK index. I need to delete all duplicates leaving only earliest occurrence of each pattern.

I plan to do the following:

Grab earliest row for each uniq pattern into temp table.

INSERT INTO #MyTempTable
SELECT [value], min([time])
FROM [MyBigTable]
GROUP BY [value]

Truncate [MyBigTable] then populate it from #MyTempTable.

Is this a good way to do this?

Misha Brukman
  • 208
  • 2
  • 12
Ivan Koshelev
  • 245
  • 1
  • 2
  • 6
  • If nothing depends on your primary key, then this is a viable solution. You could also just run multiple batch deletes from the table - this would remove the need for a temp table. – Mr.Brownstone Dec 28 '18 at 21:42
  • 2
    I wouldn't use a temp. What happens if you accidentally click X on your SSMS tab or your computer (or SSMS) crashes after the truncation before you reinsert them? That temp table would cease to exist. Instead, I would use a persisted / actual table. Call it MyBigTable2. Then you can use it like you intended, or simply drop the MyBigTable (or rename it to MyBigTable_Old for a while) and then rename your "new" table, which is where you inserted your unique records aka MyBigTable2 to MyBigTable. – S3S Dec 28 '18 at 21:44
  • 5
    Create a physical table with different schema , insert required data and then do a schema switch. Test it in a non prod env and then implement it. – Kin Shah Dec 28 '18 at 21:51
  • 2
    Also make sure the table is locked for the duration to prevent rows being lost if inserted after the snapshot of rows to keep is taken. – Martin Smith Dec 28 '18 at 22:02

0 Answers0