3

I am looking to remove older data from a table which consists of 889 million rows of data.

I have a script but I am trying to make it more robust as removing approximately 418 million rows of data to be precise.

Am running in Postgres 9.6, with the table and FK Constraints

          Column          |           Type           | Collation | Nullable |   
             Default                
--------------------------+--------------------------+-----------+----------+---
------------------------------------
 game_id                  | integer                  |           | not null | ne
xtval('game_game_id_seq'::regclass)
 game_id                  | integer                  |           | not null | 
 session_id               | integer                  |           |          | 
 game_created_on          | timestamp with time zone |           | not null | 
 currency_code            | character(3)             |           | not null | 
 game_cash_staked         | numeric(12,2)            |           |          | 
 game_cash_won            | numeric(12,2)            |           |          | 
 game_bonus_staked        | numeric(12,2)            |           |          | 
 game_bonus_won           | numeric(12,2)            |           |          | 
 game_created_by_id       | integer                  |           | not null | 
 game_remote_ref          | character varying(50)    |           |          | 
 game_description         | text                     |           |          | 
 game_batch_id            | integer                  |           |          | 
 game_rejection_code_id   | integer                  |           |          | 
 game_rejection_message   | character varying(255)   |           |          | 
 game_transfer_remote_ref | character varying(128)   |           |          |

Indexes: "game_pkey" PRIMARY KEY, btree (game_id) "idx_game_created_on_rejection_code" btree (game_created_on) WHERE game_rejection_code_id IS NULL "idx_game_game_created_on" btree (game_created_on) "idx_game_session_id" btree (session_id) "game_idx_01" btree (game_remote_ref) "game_idx_game_id" btree (game_id) Foreign-key constraints: "ref_game_to_currency" FOREIGN KEY (currency_code) REFERENCES currency(currency_code) "ref_game_to_game" FOREIGN KEY (game_id) REFERENCES game(game_id) "ref_game_to_game_rejection_code" FOREIGN KEY (game_rejection_code_id) REFERENCES game_rejection_code(game_re jection_code_id)

Scipt have looked to use:

begin;
CREATE TABLE gamearchived AS
SELECT t.*
FROM  game t
where t.game_created_on < NOW() - interval '1 year'; -- this grabs stuff Older than 1 year
delete from game t
where  t.game_id in (select gamearchived.game_id from gamearchived);
select count (*) from gamearchived
COMMIT;

I am wondering if this is the safest way to remove the old data from the main table or to do in batches. Also my current table which I will be removing data from has indexes and foreign key constraints, would it be better to drop the indexes first prior to deleting and then adding them back in once done. The amount of data removed is approximately 450 million rows.

Need to keep the old data so it needs to be accessible. Any advice is much appreciated.

rdbmsNoob
  • 459
  • 6
  • 24
  • Show your table and index definitions instead of just talking about it. And always your version of Postgres. SELECT version(); Also, please be consistent. approximately 450 million rows <> approximately 400 million rows. Is there concurrent (write) access to the DB? – Erwin Brandstetter Oct 27 '21 at 13:37
  • @ErwinBrandstetter apologies, have updated. – rdbmsNoob Oct 27 '21 at 14:22

1 Answers1

3

The first step would be to upgrade to a current version. Postgres 9.6 reaches EOL on November 11, 2021 - in three weeks. Postgres 13 or 14 are much faster with big data. Dropping and recreating all indexes has additional benefit in this case: that way you tap into the new feature index deduplication of Postgres 13 or later: shrinks indexes with duplicative data to a fraction of their size.

Assuming no concurrent access.

would it be better to drop the indexes first prior to deleting and then adding them back in once done.

Because you delete half the table, typically yes. The added benefit is that recreated indexes are in pristine condition without bloat.

This will be much faster:

BEGIN;
CREATE TABLE public.gamearchived (LIKE public.game);

-- DROP all indexes on table game here (be sure to remember the DDL!)

WITH del AS ( DELETE FROM game WHERE game_created_on < NOW() - interval '1 year'; -- older than 1 year RETURNING * ) , ins AS ( INSERT INTO public.gamearchived SELECT * FROM del ORDER BY game_created_on; -- optional, only if it helps future queries ) SELECT count(*) FROM del; -- get your count

-- run checks if you are not sure; last chance.

COMMIT;

If anything goes wrong, the transaction is rolled back. So this is safe.

Creating the new table within the same transaction saves a lot of overhead: no additional WAL needs to be written.

The main difference: this only needs a single sequential scan on the big table. Your original does a lot of additional (pointless) work.

Also, we don't need any indexes for the big DELETE. It's cheaper to drop them before and recreate them after, than to incrementally keep them up to date.

I would at least run this afterwards:

VACUUM ANALYZE game;
VACUUM ANALYZE gamearchived;

To free up space (and while thee is no concurrent access), even:

VACUUM FULL ANALYZE game;

(No point in running VACUUM FULL on gamearchived.)

Ideally, recreate all indexes on table game now.
Also, create any constraints or indexes (PK?) you want on table gamearchived. Doesn't have to be in the same transaction.
The important bit is to do it after the big insert, which is much cheaper and typically produces better results (indexes balanced without bloat).

Outgoing FK constraints in table game don't matter, the don't impose any additional work for DELETE. (Incoming FK constraints would matter, as Postgres would have to check for possible referencing rows in linked tables.)

Related:

Basics in this related chapter of the manual: Populating a Database

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
  • thanks for this, have since updated the original post regarding table structure along with Indexes/FK Constraints. In the query you wrote this part --> ORDER BY game_created_on; if my original table has a PK Key on the game_id would it be better to order via this? Also the FK constraints would it be better to drop along with the indexes and then apply afterwards? Thanks again! – rdbmsNoob Oct 27 '21 at 14:59
  • @rdbmsNoob: I commented on the FK constraint in the meantime. The ideal physical order of rows depends on the predominant query load. Ordering by the PK may or may not be ideal. – Erwin Brandstetter Oct 27 '21 at 15:09
  • I get a syntax error --> syntax error at or near ";" doesnt tell me which position it has an issue with. Why would this be? – rdbmsNoob Oct 28 '21 at 14:16