Duplicate rows - how to remove one?

Question

I have a pretty large table (~114 million rows) containing OS MasterMap data. This is freshly loaded data in a new table. When trying to set the primary key, I get this error:

ERROR:  could not create unique index "tbl_os_mmap_topoarea_pkey"
DETAIL:  Key (toid)=(1000000004081308) is duplicated.

Somehow, I have ended up with an exactly duplicated row. Every field is the same in these two rows. I want to delete one row, but keep the other. As there is no way to distinguish between the two, how can this be done?

I would like to do this as quickly and simply as possible. Creating temporary tables etc. is not really an option as it would take too long on a dataset of this size. Creating a new unique ID column would be quicker I guess, but also probably take some time.

After a bit of research, I have learned that all records in postgres have a hidden unique id, the ctid. Can I use this to delete one of the duplicate rows?

@Vérace I already did that. But then how can I delete just one of them? — Matt, May 12 '16 at 10:21

score 11 · Accepted Answer · edited Jun 15 '20 at 09:05

11

This will work I think:

with d as 
  ( select ctid, row_number() over (partition by t.*) as rn 
    from tablename as t 
  ) 
delete from tablename as t 
using d 
where d.rn > 1 
  and d.ctid = t.ctid ;

And another variation. Not sure which will be more efficient:

delete from tablename as t 
where exists 
      ( select * 
        from tablename as d 
        where d.ctid > t.ctid 
          and d.* is not distinct from t.*
      ) ;

But note what the docs say abut ctid:

ctid

The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. The OID, or even better a user-defined serial number, should be used to identify logical rows.

So, if the table was created WITH OIDS, use that instead.

edited Jun 15 '20 at 09:05

Community

1

answered May 12 '16 at 10:44

ypercubeᵀᴹ

97,895
13
214
305

Thanks for the answer @ypercube. Your mention of OIDs lead me to double check. My table did have OIDs, so I was able to use those to remove the duplicate row. – Matt May 12 '16 at 14:30
Nice. I'm not cure ctid might be ok as well in a single statement/transaction. The problems may appear in long running transactions or other operations that take longer. See also the duplicate question and its answers, for other options. – ypercubeᵀᴹ May 12 '16 at 16:31
As a data point, I had a table with ~85000 rows containing ~4000 duplicates, and the first version of this query was far more performant for me - it finished almost instantly, while the second version was timing out after 2 hours. – Jeremy Penner Apr 15 '19 at 15:13

Duplicate rows - how to remove one?

1 Answers1