Fastest query for distinct IDs in a many-to-many relationship

Question

I have this table in PostgreSQL 9.4:

CREATE TABLE user_operations( 
    id SERIAL PRIMARY KEY, 
    operation_id integer, 
    user_id integer )

The table consists of ~1000-2000 different operations each of which corresponds to some subset (consisting of approximately 80000-120000 elements each) of the set S of all users:

S = {1, 2, 3, ... , 122655}

Parameters:

work_mem = 128MB
table_size = 880MB

I also have an index on the operation_id.

QUESTION: What would be the most optimal plan for querying all distinct user_id for a significant part of the operation_id set (20%-60%) like:

SELECT DISTINCT user_id FROM user_operation WHERE operation_id < 500

It's possible to create more indexes on the table. Currently, the plan for the query is:

HashAggregate  (cost=196173.56..196347.14 rows=17358 width=4) (actual time=1227.408..1359.947 rows=598336 loops=1)
  ->  Bitmap Heap Scan on user_operation  (cost=46392.24..189978.17 rows=2478155 width=4) (actual time=233.163..611.182 rows=2518122 loops=1)
        Recheck Cond: (operation_id < 500)
        ->  Bitmap Index Scan on idx  (cost=0.00..45772.70 rows=2478155 width=0) (actual time=230.432..230.432 rows=2518122 loops=1)
              Index Cond: (operation_id < 500)

Is such query plan really optimal in such circumstances? I mean, I'm not sure about correctness of using Bitmap Heap Scan. I'll appreciate any references to relevant articles.

If my question seems too broad, please let me know what kind of paramters I should add to it to make it more specific. — St.Antario, Oct 21 '15 at 06:20
What's your problem with the bitmap heap scan? :) Anyway, the numbers you have above differ significantly from those in the plan. Why is it so? Also, an EXPLAIN (ANALYZE, BUFFERS) could possibly tell a bit more about the optimalness of your plan. What is you expectiation (time-wise)? There are chances that an index on (operation_id, user_id) would result in a somewhat more efficient query. — András Váczi, Oct 21 '15 at 08:41
@dezso I've tried that index.The planner won't even consider it. — St.Antario, Oct 21 '15 at 08:47
That you don't know :) It just didn't choose it. OK, this is nitpicking :) Did you ANALYZE the table? Also, after SET enable_bitmapscan TO off;, does it show up in the plan? Is it worse or better? — András Váczi, Oct 21 '15 at 09:00
@dezso Yes, I ran ANALYZE on the able after creating the index. About, Hmmm, maybe. But I really don't see how that index may help the query, we don't restrict it by user_id... — St.Antario, Oct 21 '15 at 09:06
You don't restrict, but doing a DISTINCT on it. My idea is that a single index-only scan would be enough to collect the data for the aggregation - lowering the I/O needs. — András Váczi, Oct 21 '15 at 09:22

score 4 · Accepted Answer · edited May 23 '17 at 12:40

What would be the most optimal plan for querying all distinct user_id for a significant part of the operation_id set (20%-60%).

Use a recursive query:

WITH RECURSIVE cte AS (
   (  -- parentheses are required
   SELECT user_id
   FROM   user_operations
   WHERE  operation_id < 500
   ORDER  BY user_id
   LIMIT  1
   )
   UNION ALL
   SELECT u.user_id
   FROM   cte, LATERAL (
      SELECT user_id
      FROM   user_operations
      WHERE  operation_id < 500
      AND    user_id > cte.user_id  -- lateral reference
      ORDER  BY user_id
      LIMIT  1
      ) u
   )
TABLE cte;

In combination with an index on (user_id, operation_id) - columns in that order. I expect index scans that filter on the second column. Reasonably accurate table statistics are important so Postgres knows it will only have to skip a few rows in the index to find the next user_id. Generally, one might want to increase the statistics target for operation_id in particular:

ALTER TABLE user_operations ALTER operation_id SET STATISTICS 1000;

Since there are only ~1000-2000 different operations, this may not even be necessary, but it's a small price to pay.

Details:

Optimizing queries on a range of timestamps (two columns)

If the predicate operation_id < 500 is stable (always the same), make that a partial index on just (user_id) instead:

CREATE INDEX foo ON user_operations (user_id) WHERE operation_id < 500;

Then statistics on operation_id are not relevant to this query any more.

Even if the predicate is not stable, there may be ways to optimize - depending on the full range of possible conditions and value frequencies.

Performance should be ... delicious.

I optimized the technique in this related answer on SO (with detailed explanation):

Optimize GROUP BY query to retrieve latest record per user

If you have a separate users table and a large part of all users can be found in your sample, even faster query styles are possible. Details in the linked answer.

May I ask you a question related to the algorithm you proposed. Since, btree indexes tree traversal have approximately constant time complexity (because of a number of the tree levels grows very slow), can we say that the "complexity of the query" is approximately O(N) where N is a number of all distinct user_ids. In fact, it starts slowly, but works very well if the many-to-many table has a very large size (much more than the count of all distinct user_ids). — St.Antario, Oct 23 '15 at 06:28
@St.Antario: Yes, approximately. In addition to the points you already mentioned, total number of rows and total table size are also relevant (to a lesser extent). If you exhaust RAM limits, performance takes a hit. If index-only scans are possible (like probably in this case), only the size of the index matters. — Erwin Brandstetter, Oct 23 '15 at 08:56

Fastest query for distinct IDs in a many-to-many relationship

1 Answers1

Linked