3

I have this query to aggregate some data from t2 into t1. This is done to optimize the application I am working on so that there are less queries to the database. I chose the below approach to make sure that I don't have to update t1 twice.

The big questions is, what indexes am I likely missing here and can the query be optimized further?

update t1
set
  col1 = t2.col1_count,
  col2 = t2.col2_sum,
  col3 = t2.col3_sum
from  (
  select
    b.user_id, b.t1_id,
    coalesce(count(b.id), 0) as col1_count,
    sum(case when b.col5 = true then b.col2 else 0 end) as col2_sum,
    sum(case when b.col5 = false then b.col3 else 0 end) as col3_sum
  from t1 a 
    left join t2 b on b.t1_id = a.id
  where
    b.user_id = 1
  group by b.user_id, b.t1_id
) as t2
where 
  t2.t1_id = t1.id;

EDIT Adding requested information

These are my current indexes:

create index ix_t1_user_id on t1(user_id);
create unique index ux_t2_t1_id_t3_id on t2(t1_id, t3_id);
create index ix_t2_user_id on t2(user_id);
create index ix_t2_t1_id on t2(t1_id);

explain analyze gives me the following result:

Update on t1  (cost=2725.40..2737.42 rows=1 width=138) (actual time=1.428..1.428 rows=0 loops=1)
  ->  Nested Loop  (cost=2725.40..2737.42 rows=1 width=138) (actual time=0.646..1.148 rows=166 loops=1)
        ->  Subquery Scan on t2  (cost=2725.40..2725.42 rows=1 width=84) (actual time=0.642..0.729 rows=166 loops=1)
              ->  HashAggregate  (cost=2725.40..2725.41 rows=1 width=17) (actual time=0.639..0.685 rows=166 loops=1)
                    ->  Nested Loop  (cost=5.81..2725.39 rows=1 width=17) (actual time=0.034..0.536 rows=197 loops=1)
                          ->  Bitmap Heap Scan on t2 b  (cost=5.81..414.29 rows=193 width=13) (actual time=0.024..0.050 rows=197 loops=1)
                                Recheck Cond: (user_id = 1)
                                ->  Bitmap Index Scan on ix_t2_user_id  (cost=0.00..5.76 rows=193 width=0) (actual time=0.017..0.017 rows=197 loops=1)
                                      Index Cond: (user_id = 1)
                          ->  Index Scan using t1_pkey on t1 a  (cost=0.00..11.96 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=197)
                                Index Cond: (id = b.t1_id)
                                Filter: (user_id = 1)
        ->  Index Scan using t1_pkey on t1  (cost=0.00..11.98 rows=1 width=58) (actual time=0.002..0.002 rows=1 loops=166)
              Index Cond: (id = t2.t1_id)
Total runtime: 1.490 ms
mhenrixon
  • 167
  • 1
  • 9
  • Added requested information – mhenrixon Jan 27 '13 at 13:35
  • Have you ANALYZEd your tables recently? The estimated and real row counts differ significantly. And an other observation: the topmost row indicates that there are no rows affected (actual rows=1). How does the plan look like if it really does something? What would be the desirable execution time? – András Váczi Jan 27 '13 at 18:09

1 Answers1

5

Simplify query

  • Remove useless user_id from subquery.
  • Remove coalesce around count(). I quote the manual on aggregate functions:

    It should be noted that except for count, these functions return a null value when no rows are selected.

    Meaning, count() never returns NULL.

  • Remove the redundant LEFT JOIN from the subquery (Update: not if you want to set columns to 0 where no rows are found in t2).

UPDATE t1
SET    col1 = t2.col1_count
      ,col2 = t2.col2_sum
      ,col3 = t2.col3_sum
FROM  (
   SELECT t1_id
         ,count(*) AS col1_count  -- if id is NOT NULL, count(*) is a bit faster
         ,sum(CASE WHEN col5 = true  THEN col2 ELSE 0 END) AS col2_sum -- might be simpler
         ,sum(CASE WHEN col5 = false THEN col3 ELSE 0 END) AS col3_sum -- missing info
   FROM   t2
   WHERE  user_id = 1
   GROUP  BY t1_id
   ) t2
WHERE  t2.t1_id = t1.id;

To reset rows in t1 without any matches in t2:

UPDATE t1
SET    col1 = 0, col2 = 0, col3 = 0
WHERE NOT EXISTS (SELECT 1 FROM t2 WHERE t2.t1_id = t1.id);

To do both at once, the version with LEFT JOIN the subquery like you had is probably faster, depends on your data distribution.

Avoid empty updates

If there is a chance that values in t1 are already up to date, add conditions to the WHERE clause that prevent empty updates (applies to both queries):

...
AND (col1 IS DISTINCT FROM t2.col1_count OR -- again: might be simpler
     col2 IS DISTINCT FROM t2.col2_sum   OR -- missing info
     col3 IS DISTINCT FROM t2.col3_sum)

For columns defined NOT NULL you can use <> instead of IS DISTINCT FROM.

This can make a big difference, updates are expensive.

Index

The only index you need for this (in addition to the primary key on t1.id) is:

CREATE INDEX ix_t2_user_id ON t2(user_id);
Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
  • Might have been more specific if you had provided the table definition with your question as you should. – Erwin Brandstetter Jan 27 '13 at 20:09
  • Unfortunately if I remove the empty update I also remove the possibility to reset the stats for the rows that don't have a match and I need to update the same table twice. – mhenrixon Jan 27 '13 at 23:05
  • @mhenrixon: Right, I overlooked that special case. Concerns the LEFT JOIN I removed rather than the paragraph about "empty updates". I added an alternative. – Erwin Brandstetter Jan 27 '13 at 23:51
  • Getting rid of the left outer join does minimal change on the total runtime: Total runtime: 1.237 ms but the second update will destroy our servers during peak hours: Total runtime: 105.531 ms. However adding the distinct from clause does about 30% on the total runtime for the original query: Total runtime: 0.995 ms and a stunning 80% on the update where not exists – mhenrixon Jan 28 '13 at 12:38
  • Thank you for those great suggestions, learnt a lot from your answer. – mhenrixon Jan 28 '13 at 12:39