Sort by number of related rows in referencing table

Question

Let there be two tables:

Users

id [pk] |   name
--------+---------
      1 | Alice
      2 | Bob
      3 | Charlie
      4 | Dan

Emails

 id | user_id | email 
----+---------+-------
  1 |       1 | a.1
  2 |       1 | a.2
  3 |       2 | a.3
  4 |       2 | b.1
  5 |       2 | a.4
  6 |       2 | a.5
  7 |       3 | b.2
  8 |       3 | a.6

With a single query I want to retrieve:

user's id and name
count of user's emails
user's email and its id

I'd like the output to be ordered descending by number of emails and filtered including only emails starting with 'a'. Users without emails shall be included, too - treat their emails' count as 0.

There is my query:

SELECT users.id AS user_id, users.name AS name,
       emails.id AS email_id, emails.email AS email,
       count(emails.id) OVER (PARTITION BY users.id) as n_emails
FROM users
LEFT JOIN emails on users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'
ORDER BY n_emails DESC;

And the (expected) result, it looks good:

 user_id |  name   | email_id | email | n_emails 
---------+---------+----------+-------+----------
       2 | Bob     |        6 | a.5   |        3
       2 | Bob     |        5 | a.4   |        3
       2 | Bob     |        3 | a.3   |        3
       1 | Alice   |        2 | a.2   |        2
       1 | Alice   |        1 | a.1   |        2
       3 | Charlie |        8 | a.6   |        1

It's obvious that this is a simple and small example while the actual dataset could be large enough, so I'd like to use LIMIT/OFFSET for paging. For example, I'd like to fetch a first pair of users (not just rows):

-- previous query ...
LIMIT 2 OFFSET 0;

And… fail. I've got incomplete information about Bob only:

 user_id | name | email_id | email | n_emails 
---------+------+----------+-------+----------
       2 | Bob  |        6 | a.5   |        3
       2 | Bob  |        5 | a.4   |        3

Hence the question: how can I apply limit/offset to objects, in this case, users (logical entities, not rows)?

I've found such solution: add dense_rank() over users.id and then filter by rank:

SELECT * FROM (
    SELECT users.id AS user_id, users.name AS name,
           emails.id AS email_id, emails.email AS email,
           count(emails.id) OVER (PARTITION BY users.id) as n_emails,
           dense_rank() OVER (ORDER BY users.id) as n_user
    FROM users
    LEFT JOIN emails on users.id = emails.user_id
    WHERE emails.email LIKE 'a' || '%%'
    ORDER BY n_emails DESC
    ) AS sq
WHERE sq.n_user <= 2; -- here it is

The output looks good:

 user_id | name  | email_id | email | n_emails | n_user 
---------+-------+----------+-------+----------+--------
       2 | Bob   |        6 | a.5   |        3 |      2
       2 | Bob   |        5 | a.4   |        3 |      2
       2 | Bob   |        3 | a.3   |        3 |      2
       1 | Alice |        2 | a.2   |        2 |      1
       1 | Alice |        1 | a.1   |        2 |      1

But if you look at query plan, you'll see that the most expensive steps are subquery scan and sorting. AFAIK it is impossible to build index on subquery or CTE, so it will be always sequence scan/filter over n_user and query will execute for a long time on big dataset.

Another solution I see to make two queries:

retrieve only user ids and number of emails for filtered and sorted dataset using subquery;
join first subquery with users and emails

The query is:

SELECT users.id AS user_id, users.name,
       emails.id AS email_id, emails.email,
       sq.n_emails
FROM
(SELECT users.id, count(emails.id) AS n_emails
    FROM users
    LEFT JOIN emails ON users.id = emails.user_id
    WHERE emails.email LIKE 'a' || '%%'
    GROUP BY users.id
    ORDER BY n_emails DESC
    LIMIT 2 OFFSET 0 -- here it is
    ) AS sq
JOIN users ON users.id = sq.id
LEFT JOIN emails ON emails.user_id = users.id
WHERE emails.email LIKE 'a' || '%%'
ORDER BY sq.n_emails DESC;

This seems to be much faster. But it doesn't look like good solution because I have to duplicate the exactly same query (except SELECT...FROM part), in fact, one query runs two times. Is there any better solution?

score 2 · Accepted Answer · edited May 23 '17 at 12:40

Exclude users without emails

Assuming we only want users that actually have emails. Users without emails are ignored. The reason I went with this assumption at first is that all your queries do that already:

LEFT JOIN emails on users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'

By adding a WHERE condition on emails.email you effectively convert your LEFT JOIN to a plain [INNER] JOIN and exclude users without emails. Detailed explanation:

Query with LEFT JOIN not returning rows for count of 0

2nd query rewritten

Your 2nd query does not work as advertised, results are not "descending by number of emails". You have to nest the result of count() in another CTE or subquery and run dense_rank() on it. You cannot nest window functions in the same query level.

SELECT u.name, e2.*
FROM  (
   SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
   FROM  (
      SELECT user_id, id AS e_id, e_mail
           , count(*) OVER (PARTITION BY user_id) AS n_emails          
      FROM   emails
      WHERE  email LIKE 'a' || '%'  -- one % is enough
      ) e1
   ) e2
JOIN   users u ON u.id = e2.user_id
WHERE  rnk < 3
ORDER  BY rnk;

This should be fastest if the predicate is selective enough (selects only a small fraction of all emails). Two window functions with rows sorted differently have their price, too.

A major point is to run the subquery on emails only - which is possible if the preliminary assumption holds.

3rd query improved

If, on the other hand, the predicate WHERE e.email LIKE 'a' || '%' is not very selective, your 3rd query is probably faster, even if it reads from the table twice - but the second time only desired rows. Also improved:

SELECT e.user_id, u.name,
       e.id AS e_id, e.e_mail, sq.n_emails
FROM  (
   SELECT user_id, count(*) AS n_emails
   FROM   emails
   WHERE  email LIKE 'a' || '%'
   GROUP  BY user_id
   ORDER  BY count(*) DESC, user_id  -- break ties
   LIMIT  2  OFFSET 0
   ) sq
JOIN   emails e USING (user_id)
JOIN   users  u ON u.id = e.user_id
WHERE  e.email LIKE 'a' || '%'
ORDER  BY sq.n_emails DESC;

Include users without emails

You can either include the users table in the inner query again, similar to what you had before. But you have to pull the filter on email into the join condition!

SELECT u.name, e2.*
FROM  (
   SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
   FROM  (
      SELECT u.id AS user_id, u.name, e.id AS e_id
           , count(e.user_id) OVER (PARTITION BY u.id) AS n_emails          
      FROM   users u
      LEFT   JOIN emails e ON e.user_id = u.id
                          AND e.email LIKE 'a' || '%'  -- !!!
      ) e1
   ) e2
WHERE  rnk < 3
ORDER  BY rnk;

Which will be a bit more expensive.

Since you retrieve users with the most emails first, users without emails will rarely be in the result. To optimize performance, you could use a UNION ALL with LIMIT:

(  -- parentheses required
SELECT u.name, e2.user_id, e2.e_id, e2.e_mail, e2.n_emails
FROM  (
   SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
   FROM  (
      SELECT user_id, id AS e_id, e_mail
           , count(*) OVER (PARTITION BY user_id) AS n_emails          
      FROM   emails
      WHERE  email LIKE 'a' || '%'  -- one % is enough
      ) e1
   ) e2
JOIN   users u ON u.id = e2.user_id
WHERE  rnk < 3      -- adapt to paging!
ORDER  BY rnk
)
UNION ALL
(    
SELECT u.name, u.user_id, NULL AS e_id, NULL AS e_mail, 0 AS n_emails  
FROM   users       u
LEFT   JOIN emails e ON e.user_id = u.id
                    AND e.email LIKE 'a' || '%'
WHERE  e.e.user_id IS NULL
)
OFFSET 0      -- adapt to paging!
LIMIT  2      -- adapt to paging!

Detailed explanation:

Optimize a query on two big tables

`MATERIALIZED VIEW`

I would consider materializing the result for two reasons:

Subsequent queries are much faster.
You don't have to operate on a moving target. You speak of paging, and if users get new emails between pages, your whole sort order may be moot.

Build a MV from the 2nd query without LIMIT (REFRESH MATERIALIZED VIEW), then return the first page etc. It's a matter of policy, when you refresh the MV again.

you said that "Your 2nd query does not work as advertised, results are not "descending by number of emails", but I ran this query in psql and got the desired results, ordered by number of emails descending. The puprose I've used dense_rank() is to numerate "users" entity, numerate the groups of rows with equal users.id, to apply limit/offset by filtering through dense_rank(). It works as desired in each case I've tested. The issue is in the speed of query. — Alex Sidorov, Dec 15 '14 at 12:42
@AlexSidorov: Concerning order of rows in your 2nd query: If you got desired results, then that must be an odd coincidence. The expression dense_rank() OVER (ORDER BY users.id) obviously doesn't sort "descending by number of emails", but just by the numerical (and typically meaningless) value of users.id. — Erwin Brandstetter, Dec 15 '14 at 13:20
@AlexSidorov: I have added solutions including users without email, including a variant to optimize performance, since that seems like your main objective. Also note the added explanation at the top: Why your queries so far all excluded users without emails. — Erwin Brandstetter, Dec 15 '14 at 13:29
I'm using ORDER BY n_emails DESC in my 2nd query, but I've noticed you're using ORDER BY rnk. I think this is the misunderstanding. — Alex Sidorov, Dec 16 '14 at 11:13
@AlexSidorov: The ORDER BY in your subquery is not effective while you apply WHERE sq.n_user <= 2 in the outer query. This selects users with the lowest id, not the ones with the most emails. I apply ORDER BY rnk to the outer query, which does not affect the selection of rows, only their final sort order. — Erwin Brandstetter, Dec 16 '14 at 11:23