Can I identify duplicates in a large MySQL insert?

Question

I currently have a PHP script that inserts thousands to millions of rows into an InnoDB table. To prevent duplicates, the table has a UNIQUE index set up on the combination of four columns, and I use INSERT IGNORE to allow the insert to continue regardless of duplicates.

For performance reasons, rather than inserting one row at a time, I batch them up into 2000 rows per query. I want to know which individual rows are ignored due to a key violation, and the only way I can think to do it is to insert one row at a time and then check the value of mysqli_affected_rows after each insert, but that feels inefficient and I will lose the advantage of batching my inserts. Alternatively I could remove the UNIQUE index and retrospectively check for duplicates using some SQL at the end.

Any other suggestions?

Many thanks.

score 1 · Answer 1 · answered Jan 05 '18 at 15:09

This can probably be solved with a more complex set of steps. You still get the benefit of the bulk INSERT.

Build a temp table (probably permanent, but TRUNCATE between uses).

INSERT the 2K rows into it.

Analyze the rows via a JOIN between the temp table and the real table.

INSERT INTO real ... SELECT ... JOIN ... to put the "new" rows.

More details in http://mysql.rjweb.org/doc.php/staging_table , especially the section on 'Normalization', which uses two queries in a slightly different way -- one to insert 'new' rows, another to pull back the id for every row.

Your app may also benefit from the "flip-flop" technique described there.

I haven't yet had a chance to experiment with this but wanted to thank you in the meantime for your suggestion. Using some kind of temp table certainly makes sense :) — Ben Price, Jan 16 '18 at 16:03

Can I identify duplicates in a large MySQL insert?

1 Answers1

Linked