Aurora / PostgreSQL Mass Update on 100 Million Rows Coming to a Halt

Question

During the migration of a 300 million rows table I have encountered a strange behaviour on our AWS Aurora database:

Around 25 minutes into the migration, the total IOPS of the database dropped to almost 0 and it seemed to be idling without waits and did not return - seemingly stuck in the operation.

How can this behaviour be explained? Is it related to Aurora and the read replicas? I do expect that the migration can run on Aurora as it did on another PostgreSQL RDS instance.

Quick facts:

AWS Aurora PostgreSQL database, 3 replicas (same instance/class/size), 2 reader, multi AZ.
Huge table: Migration affects ~100 million rows of ~330 million total rows.
Migration was split into 3 chunks of 100 million rows.
Dead tuples where cleaned up before migration with manual vacuum.
Migration ran successfully on dev environment (db copy on PostgreSQL RDS instance) in about 70 minutes

Chunk 1 query in question:

-- Expected affected rows: 15154380
UPDATE debtclaim_event
SET acceptance_mode = 'NORMAL',
    modifiedon      = now(),
    modifiedby      = 'system#FC-7868',
    optlock         = optlock + 1
WHERE type = 'ACCEPTED'
  AND id < 100000000;

related explain / analyse:

EXPLAIN ANALYSE SELECT * FROM debtclaim_event WHERE type = 'ACCEPTED' AND acceptance_mode IS NULL;

resulted in:

Seq Scan on debtclaim_event  (cost=0.00..10236490.40 rows=33147254 width=550) (actual time=7.737..141019.913 rows=33358994 loops=1)
  Filter: ((acceptance_mode IS NULL) AND (type = 'ACCEPTED'::text))
  Rows Removed by Filter: 265944598
Planning Time: 0.096 ms
Execution Time: 142877.726 ms

Further steps

I tried to reduce the chunk size to 10 million. This resulted in the first chunk passing (which had no updates) and the second chunk passing with ~1,4 milltion rows altered in about 3 minutes.

The third chunk showed a similar behavoiur, as it dropped total iops after about 6 minutes and then idling for 15 more minutes, not returning.

Besides that I did not try to optimize the migration further. A possible way forward could be a new table creation, instead of updating the original one, as described in another post.

More insights from AWS RDS

During the operation other sessions tried to access and alter the table in question, but those mostly waited:

Here is another graph depicting the operation in further detail:

@LaurenzAlbe I don't think so, as IO is a paid resource with Aurora, see https://aws.amazon.com/rds/aurora/pricing/?nc1=h_ls — Philipp Bzdok, Feb 27 '23 at 15:52
Look in pg_stat_activity for this query. What is the wait_event? What is the status? — jjanes, Feb 27 '23 at 16:23

score 1 · Accepted Answer · answered May 19 '23 at 08:33

While it was not quite clear what exactly caused this slowdown AWS support engineers hinted that it most likely was caused by some kind of limited bandwidth due to instance size (making @LaurenzAlbe 's call pretty accurate). The proposed solution was scaling the writer instance or the whole cluster to be sure.

Vertically scaling the cluster instances to .8xlarge allowed for more bandwidth and more bandwidth burst? credits. This enabled us to run a similar migration with hundreds of million of rows in under 1 hour.

Additionally the migration was optimized by making it a create table migration (as seen in another post: Optimizing bulk update performance in PostgreSQL) instead of updating in place, resulting in less DB operations and less dead tuples.

Aurora / PostgreSQL Mass Update on 100 Million Rows Coming to a Halt

Quick facts:

Further steps

More insights from AWS RDS

1 Answers1