6

I have a basic question on how JOIN works on multiple tables. I want to count occurrences of Foreign Key in link1 & link2

CREATE TABLE main (
   id SERIAL PRIMARY KEY,
   name text NOT NULL
);

CREATE TABLE link1 (
   id SERIAL PRIMARY KEY,
   main_id integer NOT NULL,
   CONSTRAINT main_id_fk FOREIGN KEY (main_id) REFERENCES main (id)
);

-- link2 is similar to link1

SQL Fiddle

Why does the query below give a product of counts (rather than sum) when the count is non-zero in both columns.

SELECT main.id, COUNT(link1.main_id) + COUNT(link2.main_id)
FROM main
LEFT JOIN link1 ON main.id=link1.main_id
LEFT JOIN link2 ON main.id=link2.main_id
GROUP BY main.id
user4150760
  • 1,099
  • 3
  • 14
  • 20

2 Answers2

8

What you see is a "proxy cross join". Aggregate first, then join:

SELECT m.id, COALESCE(l1.ct, 0) + COALESCE(l2.ct, 0) AS total_ct
FROM   main m
LEFT   JOIN (
   SELECT main_id, count(*) AS ct
   FROM   link1
   GROUP  BY main_id
   ) l1 ON l1.main_id = m.id
LEFT   JOIN (
   SELECT main_id, count(*) AS ct
   FROM   link2
   GROUP  BY main_id
   ) l2 ON l2.main_id = m.id
ORDER BY m.id;

SQL Fiddle.

Do not multiply rows with multiple unqualified joins and count(DISTINCT ...) later to fix the mistake. It happens to work in this case since counting DISTINCT link1.id / link2.id coincides with the desired result, but it's needlessly expensive and error prone.

Detailed explanation and a couple of syntax variants in these related answers on SO:

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
3

I'll attempt to answer it myself. Consider a LEFT JOIN between main & link1. The output would be

main.id   link1.main_id
   1          1
   1          1
   2          2
   3         NULL  
   4         NULL  

Now do a LEFT JOIN of the above table with link2, output would be:

main.id    link1.main_id    link2.main_id
   1             1               NULL  
   1             1               NULL  
   2             2                2
   2             2                2     -- Error : double counting for link1
   3            NULL              3
   4            NULL                   

Now count the occurrences of main_id & sum them (grouped by main.id)

main.id        Count
   1             2
   2             2 + 2  
   3             1
   4             0

So two successive LEFT JOIN are happening sequentially rather than in parallel. The correct approach to get the count would be do conduct 2 queries separately and then add the results

Update Another way according to @a1ex07 is

SELECT main.id, COUNT(DISTINCT link1.id) + COUNT(DISTINCT link2.id)
FROM main
LEFT JOIN link1 ON main.id=link1.main_id
LEFT JOIN link2 ON main.id=link2.main_id
GROUP BY main.id
user4150760
  • 1,099
  • 3
  • 14
  • 20
  • You seem to confuse 0 with NULL . The first 2 outputs won't have any zeros. link1.main_id and link2.main_id might be either equals to main.id or NULL. And why not just COUNT(DISTINCT link1.id) + COUNT(DISTINCT link2.id) ? – a1ex07 Dec 30 '14 at 15:00
  • @a1ex07 Thanks for pointing out about NULL. I'll update the answer. Regarding DISTINCT, I want the count to reflect multiple entries in any table. – user4150760 Dec 30 '14 at 15:06
  • 1
    I believe distinct will do that (note, it's counting distinct link1.id and link2.id, not link1.main_id/ link2.main_id ). Thus, in your example for main.id = 2 there will be 1 distinct link1.id and 2 distinct link2.id... – a1ex07 Dec 30 '14 at 15:24
  • Right, updated the answer. Cool technique. – user4150760 Dec 30 '14 at 17:36
  • 2
    @a1ex07: While DISTINCT does the trick here, it's a bit like putting lipstick on a pig ... – Erwin Brandstetter Jan 03 '15 at 05:35
  • @ErwinBrandstetter : Nothing tricking about approach with distinct. It gives correct result, as well as other solutions like yours or mentioned in the links you provided. Performance-wise distinct might behave worse, but I'm not absolutely sure about it without seeing real tests with real data. – a1ex07 Jan 04 '15 at 16:09
  • @a1ex07: I have been running many performance tests over the past few years. For small tables it doesn't matter much - because nothing matters much with small tables (except for the worst atrocities). The (limited) CROSS JOIN between rows in link1 and link2 has a cost characteristic of O(N²), which can be catastrophic for big tables (with many combinations). The additional separate DISTINCT steps aren't cheap either. I'll look for a related answer with benchmarks ... – Erwin Brandstetter Jan 05 '15 at 03:39
  • Related answer with benchmark on SO: http://stackoverflow.com/questions/11350686/combine-results-of-joins-on-two-tables/11351593#11351593. Also consider the discussion in the comments under the second answer by Gordon: http://stackoverflow.com/a/11351033/939860 – Erwin Brandstetter Jan 05 '15 at 04:29
  • @ErwinBrandstetter: It seems to me that there is a big difference between this question and links in your comment . The latter doesn't give optimizer any choice but perform cross join and then filter out duplicates to satisfy distinct. In the former I'd expect optimizer to generate the same plan as for your query . – a1ex07 Jan 05 '15 at 16:11
  • @a1ex07: I suppose you run a benchmark then and see for yourself. It's the same basic principle. – Erwin Brandstetter Jan 05 '15 at 16:55
  • It's not the same for those 2 particular queries. I'm not arguing logical stages, but physical order can be completely different. – a1ex07 Jan 05 '15 at 19:07