2
update line_items set product_id = dups.latest_product_id
    from  line_items as li
    join products p on li.product_id = p.id
join vendors v on p.vendor_id = v.id
    join vendorgroups vg on v.vendorgroup_id = vg.id
    join duplicate_product_sets dups
      on dups.host_id = vg.host_id and dups.vendor_id = v.id
     and dups.invtid = p.invtid and dups.latest_product_id <> p.id

I have dropped indexes on line_items (except the primary key) because I figured since it is the table being updated, indexes would slow down the update.

Other than that, I have indexes on these columns (in addition to primary keys):

products.invtid
products.vendor_id
vendors.vendorroup_id
vendorgroups.host_id
duplicate_product_sets.host_id
duplicate_product_sets.vendor_id
duplicate_product_sets.invtid
duplicate_product_sets.latest_product_id

The result of EXPLAIN:

-------------------------------------------------------------------------------------------------------------------------------
 Update on line_items  (cost=5444.25..5004105.00 rows=397508809 width=753)
   ->  Nested Loop  (cost=5444.25..5004105.00 rows=397508809 width=753)
         ->  Seq Scan on line_items  (cost=0.00..12944.60 rows=434160 width=719)
         ->  Materialize  (cost=5444.25..20030.69 rows=916 width=34)
               ->  Hash Join  (cost=5444.25..20026.11 rows=916 width=34)
                     Hash Cond: (li.product_id = p.id)
                     ->  Seq Scan on line_items li  (cost=0.00..12944.60 rows=434160 width=10)
                     ->  Hash  (cost=5433.62..5433.62 rows=851 width=32)
                           ->  Nested Loop  (cost=147.90..5433.62 rows=851 width=32)
                                 Join Filter: (v.id = p.vendor_id)
                                 ->  Hash Join  (cost=147.47..1253.41 rows=700 width=40)
                                       Hash Cond: ((dups.vendor_id = v.id) AND (dups.host_id = vg.host_id))
                                       ->  Seq Scan on duplicate_product_sets dups  (cost=0.00..706.97 rows=39197 width=28)
                                       ->  Hash  (cost=117.37..117.37 rows=2007 width=20)
                                             ->  Hash Join  (cost=36.70..117.37 rows=2007 width=20)
                                                   Hash Cond: (v.vendorgroup_id = vg.id)
                                                   ->  Seq Scan on vendors v  (cost=0.00..53.07 rows=2007 width=14)
                                                   ->  Hash  (cost=24.09..24.09 rows=1009 width=14)
                                                         ->  Seq Scan on vendorgroups vg  (cost=0.00..24.09 rows=1009 width=14)
                                 ->  Index Scan using products_by_invtid on products p  (cost=0.42..5.96 rows=1 width=24)
                                       Index Cond: ((invtid)::text = (dups.invtid)::text)
                                       Filter: ((dups.latest_product_id <> id) AND (dups.vendor_id = vendor_id))
(22 rows)

I am using Postgres 9.4.

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
septerr
  • 183
  • 1
  • 8

1 Answers1

1

Query

You UPDATE statement looks good except for one major problem. Fixed and re-formatted with some other minor improvements:

UPDATE line_items li
SET    product_id = d.latest_product_id
FROM   products               p
JOIN   vendors                v  ON v.id  = p.vendor_id
JOIN   vendorgroups           vg ON vg.id = v.vendorgroup_id
JOIN   duplicate_product_sets d  ON d.invtid    = p.invtid
                                AND d.vendor_id = p.vendor_id
                                AND d.host_id   = vg.host_id
                                AND d.latest_product_id <> p.id
WHERE  p.id = li.product_id;
  • You don't need to join line_items a second time. And you also forgot to join the two instances, producing a cross join (as pointed out by @a_horse in the comments). I removed the redundant table and now the join to products does the job.

Indices

I have dropped indexes on line_items (except the primary key index) ...

Basically correct, but If you only update a small fraction of rows, you should still have an index on product_id, even if that incurs a small cost for index update. The gain should be much more substantial.

For best performance, these are the perfect indices:

li(product_id)
p(vendor_id, invtid, id)
v(id, vendorgroup_id)
vg(id, host_id)
d(vendor_id, invtid, host_id, latest_product_id)

I mean multicolumn indices where I listed multiple columns. Indices on individual columns also go a long way, but not all the way. Multicolumn indices are most efficient when index-only scans are possible. In the best case scenario, this update would only need to scan the suggested indices without touching any table except line_items.

Whether it pays to optimize this operation depends on the complete picture. Additional indices incur a small cost for write performance.

And be sure, that you use the same (adequate) data type for invtid in both columns. I see a cast to text in the EXPLAIN output that shouldn't be there.

Related:

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
  • You are a genius at this! The query now completes in few seconds. I will read up the links you included. I really want to understand the correct indexes to add, like you do! Thank you!! – septerr Apr 18 '15 at 22:19
  • 2
    @septerr: the main reason the query is faster is the removal of the implicit cross join you created between line_item and itself through the from line_items part. Read the note in the manual regarding this: http://www.postgresql.org/docs/current/static/sql-update.html#AEN85302 –  Apr 21 '15 at 20:32
  • @a_horse_with_no_name: Good point. I fixed that in passing without noting how deep the rabbit hole went .. – Erwin Brandstetter Apr 21 '15 at 20:37
  • @a_horse_with_no_name thank you. That makes sense. For some reason it was hard for me to think of updating a table with joins without having the primary table in the from clause. It was a revelation that that was unnecessary and moreover affecting performance. – septerr Apr 23 '15 at 18:23