Including deleted_at in unique index for upsert

Question

I have a table with a deleted_at column, see comment.

The deleted_at column behaves like a one-way boolean event flag, i.e. once a row is deleted it can't be restored.

There is also an asynchronous process that permanently deletes rows older than a certain timestamp. This process works in batches, for example: "pt-archiver nibbles records from a MySQL table"

For efficiently looking up rows to delete (and paging in general) the table has an id column as primary key. This column is generated in code, it's a GUID, and it's sortable by creation date.

primary key (`id`)

Deleting rows is a three step process

-- De-activate
update x set deleted_at = :timestamp where id = :id
-- Select IDs to delete
select id from x where deleted_at is not null
-- Hard delete
delete from x where id in (...)

Also, the table is updated in batches with an upsert query. The "on duplicate key update" clause makes use of a unique index consisting of two columns (code, deleted_at). The code must be unique over all active rows, deleted_at is null for active rows, and the upsert query looks like this

insert into x (id, code, foo, deleted_at)
values (...)
on duplicate key update foo=values(foo)

Is it a bad idea to make deleted_at part of a unique index, and if so why?

unique key `uniq_code_deleted_at`(`code`, `deleted_at`)

Well with such a unique key, someone could insert multiple code rows with different deleted_at, unless you have another unique index over code by itself. What do you want to happen? — Charlieface, May 10 '23 at 11:13
There must only be one "active" row (deleted_at is null) per code — mozey, May 11 '23 at 02:03
Then you need a filtered unique index, which MySQL doesn't support yet. You would have to normalize it out into another table x_Active which is foreign-keyed to x — Charlieface, May 11 '23 at 09:31
The on duplicate key update clause of the upsert query ensures there is only one active row per code — mozey, May 11 '23 at 09:49

Rick James · Answer 1 · 2023-05-11T18:28:04.023

1

By "GUID", are you referring to "KSUID"?
A UNIQUE index confers a uniqueness constraint on the set of columns in it. Tacking on extra columns invalidates the constraint on simply code.
When you talk about "deleting", are you referring to the "soft delete" using deleted_at? Or the "hard delete" using DELETE FROM t WHERE ...?
Do you reference (say JOIN) this table from other tables ON the id? If not, why have the id column, why not?
For (code, deleted_at) where code is unique, why add the extra column. Upsert will be happy to do nothing if you change deleted_at to the value it is currently set to. (Except it might "burn" and auto_increment id.)
If you plan to "delete" most or all of a range of GUIDs at a time, don't bother pulling the ids out, that is an unnecessary extra step. But do bound the area to search.
More suggestions on big deletes: http://mysql.rjweb.org/doc.php/deletebig

Please provide SHOW CREATE TABLE and the actual queries involved.

Possible DELETE

To preserve most recently de-activated rows it makes sense to select the IDs first

DELETE FROM t
       WHERE deleted_at IS NOT NULL
       ORDER BY deleted_at ASC
       LIMIT 100;

But that does not control how many of the "most recently deactivated rows" will be kept. So, let's flip that around, and use an OFFSET. Suppose you want to keep the most recent 50 and delete no more than 100 at a time:

DELETE FROM t
       WHERE deleted_at IS NOT NULL
       ORDER BY deleted_at DESC
       LIMIT 100  OFFSET 50;

INDEX(deleted_at) would help with either DELETE, but have overhead for whenever deleted_at is changed.

These DELETEs will delete the indicated rows without explicitly mentioning id.

If none need to be deleted, they will do nothing (other than taking some time).

edited May 11 '23 at 18:28

answered May 10 '23 at 22:08

Rick James

78,038
5
47
113

Yes, KSUID "is a kind of globally unique identifier similar to a RFC 4122 UUID" that is also key sortable. The id column can be used to list rows in order of creation date – mozey May 11 '23 at 02:06
Deleting is a three step process, see edits in the question – mozey May 11 '23 at 02:08
If deleted_at is not part of the unique index, then upsert would update rows that have been de-activated. Burning auto_increment is not an issue here because the id is generated outside the DB – mozey May 11 '23 at 02:13
Doing delete from x where deleted_at is not null limit 100 is an option. To preserve most recently de-activated rows it makes sense to select the IDs first? – mozey May 11 '23 at 02:20
Can't provide show create table, this is a hypothetical question. Both the id and code columns are strings, and deleted_at is an integer. Primary key and unique index is as I've shown in the question – mozey May 11 '23 at 02:23
Edited the upsert example for clarification, the code and deleted_at columns must not be updated by upsert. – mozey May 11 '23 at 04:51
@mozey - See my addition. – Rick James May 11 '23 at 18:28

Including deleted_at in unique index for upsert

1 Answers1