Most efficient way of updating table from another - Postgres

Question

I need to populate a foreign key column based on the contents of another table, and both methods I've tried are prohibitively slow.

I have the below tables (some columns omitted):

              Table "public.interface_events"
           Column           |  Type   |     Modifiers
----------------------------+---------+--------------------
 device_id                  | integer | not null
 interface_id               | integer |

Foreign-key constraints:
    "interface_events_device_id_fkey" FOREIGN KEY (device_id) REFERENCES device(id) ON UPDATE CASCADE ON DELETE RESTRICT
    "interface_events_interface_id_fkey" FOREIGN KEY (interface_id) REFERENCES interface(id) ON UPDATE CASCADE ON DELETE RESTRICT

                                       Table "public.device"
        Column        |         Type          |                      Modifiers
----------------------+-----------------------+-----------------------------------------------------
 id                   | integer               | not null default nextval('device_id_seq'::regclass)
 interface_id         | integer               | not null

Indexes:
    "device_pkey" PRIMARY KEY, btree (id)
    "device_interface_id_idx" btree (interface_id)

Foreign-key constraints:
    "device_interface_id_fkey" FOREIGN KEY (interface_id) REFERENCES interface(id) ON UPDATE CASCADE ON DELETE RESTRICT

Column interface_id in table interface_events needs updating from table device, so currently I have:

UPDATE interface_events SET interface_id = device.interface_id 
FROM device 
WHERE device.id = interface_events.device_id

I should point out that the relationship between device and interface_events is one to many, so there may be thousands of events per device. Is the UPDATE FROM method suitable in this situation? I'd earlier tried:

UPDATE interface_events SET interface_id = (SELECT device.interface_id FROM device WHERE device.id = interface_events.device_id);

But that took forever also.

There are 30 million rows in table interface_events, 32,000 in table device.

The update running currently (using the FROM syntax) has been active for 21 hours, although this is on a low spec test machine (2 x 3G cores, 3G RAM, spindle disk - it's disk I/O that's killing it, constant 99%).

It occures to me as I write I should have applied the foreign key constaint after the update, since I know it's integrity will be ok thanks to the column I'm copying from already being constrained, so I guess this question is more about the best query to use?

How many distinct interface_ids are there? Is it possible you are overwriting interface_id with the same value for many rows? In that case you might want to add and interface_events.interface_id <> device.interface_ids — , Feb 20 '19 at 12:37
@a_horse_with_no_name Very good point, there are < 200 interface_ids. Still which format, FROM or subquery? Or are they under the hood much the same? — whoasked, Feb 20 '19 at 12:46
@a_horse_with_no_name actually sorry I should also point out interface_events.interface_id is a new column, added to allow the table to be queried at interface level without joining through device, so currently every row will be updated (from null). — whoasked, Feb 20 '19 at 12:52
Please post EXPLAIN plans for each query. What is "work_mem" set to? — jjanes, Feb 20 '19 at 13:02
Hmm, on my laptop (admittedly with a SSD) both variants run in about the same time (< 2 min). https://explain.depesz.com/s/XIBgO and https://explain.depesz.com/s/Yotp — , Feb 20 '19 at 14:02
@jjanes https://explain.depesz.com/s/hvvE, https://explain.depesz.com/s/S7P6. These are against an earlier backup of the same DB, so slightly smaller data sets. — whoasked, Feb 20 '19 at 14:17
If the hash table on devices can fit in memory (this is where work_mem is important) and so can be done in a single batch, this should be reasonably efficient. But not as efficient as just creating the table new. If it needs to do multiple batches, the performance can really suck, as the IO to the interface_events will no longer be sequential. — jjanes, Feb 20 '19 at 14:22
@jjanes work_mem is not explicitly set in the conf file.On the live server shared_buffers is set to 4G, which was set actually when the spec was lower than it is now ... On the test box this is also unset though. — whoasked, Feb 20 '19 at 14:23
The default work_mem is 4MB. You can verify what it actually is on your running server using "show work_mem;". According to the plan show by a_horse_with_no_name, that should be large enough to hold the hash_table, but sometimes the calculation can be off, so maybe it would use multiple batches anyway. You can change 'work_mem' in each session, so I would definitely bump it up in your session doing the update to at least 40MB. — jjanes, Feb 20 '19 at 14:35
@a_horse_with_no_name maybe you could lend me your laptop ;-). I've run this again with interface_id as just an int (not fkey) and I've already had 100% disk for 20 mins. Currently with work_mem=32M, shared_buffers=2G. — whoasked, Feb 20 '19 at 15:25

score 3 · Answer 1 · answered Feb 20 '19 at 13:20

The fastest way will probably be to create a new table based on a join of the existing tables, create indexes and foreign keys constraints, and then drop the old "interface_events" and rename the new table into place. Views and foreign keys into this table can complicate this.

create interface_events_temp as 
    select interface_events.*, device.interface_id from interface_events 
        join device on (device.id = interface_events.device_id)

But the interface_events.* notation relies on interface_id not already existing in interface_events. Since it already exists, you would have to list out all columns in interface_events except for interface_id.

This does actually kind of intuitively make sense, leveraging the join. Unfortunately the actual table contains quite a few columns and there are foreign keys to consider. I'm waiting on a like for like virtual for the live system to see how long it will actually take, if still too long I'll likely revert to this, thanks. — whoasked, Feb 20 '19 at 14:21

score 1 · Answer 2 · answered Feb 20 '19 at 17:22

For anyone arriving here the resolution for this lay in the answers to this post.

Whilst I'm sure the answer provided by jjanes would probably be at least as quick in terms of the update itself, for me on this occasion I felt dropping and recreating indexes was more straight forward.

With no indexes or constraints on the table the update dropped from 21+ hours to 5 minutes. Quite a fall.

Most efficient way of updating table from another - Postgres

2 Answers2