2

I need to select N rows with seq > X which reduce_key is null and for each row group with the same reduce_key select the one with the biggest seq. Result should be strictly sequential (but with rows with the same reduce_key filtered except the one with biggest seq) and limited with N.

Here is a query which does that:

explain analyze SELECT
    x2."user_id",
    x2."seq",
    x2."timestamp",
    x2."reduce_key",
    x2."mapping"
FROM
    "user_sequence" x2
WHERE
    (
        (x2."user_id" = 860862404)
        AND (x2."seq" > 3562974)
    )
AND (
    (x2."reduce_key" IS NULL)
    OR (
        x2."seq" IN (
            SELECT
                MAX ("seq")
            FROM
                "user_sequence"
            WHERE
                ("user_id" = 860862404)
            AND (
                ("seq" >= x2."seq")
                AND ("reduce_key" = x2."reduce_key")
            )
            GROUP BY
                "reduce_key" 
        )
    )
)
ORDER BY
    x2."seq"
LIMIT 1000

The problem is that it's being executed ~70ms on table with tens millions of rows. Is there any more efficient way to reduce rows by key leaving rows with null key?

Table structure:

actor=# \d user_sequence;
   Table "public.user_sequence"
   Column   |  Type   | Modifiers 
------------+---------+-----------
 user_id    | integer | not null
 seq        | integer | not null
 timestamp  | bigint  | not null
 mapping    | bytea   | not null
 reduce_key | text    | 
Indexes:
    "user_sequence_pkey" PRIMARY KEY, btree (user_id, seq)
    "user_sequence_user_id_reduce_key_seq_idx" btree (user_id, reduce_key, seq)

EXPLAIN ANALYZE:

 Limit  (cost=0.57..231389.06 rows=1000 width=103) (actual time=0.408..74.754 rows=308 loops=1)
   ->  Index Scan using user_sequence_pkey on user_sequence x2  (cost=0.57..24444343.74 rows=105642 width=103) (act
ual time=0.406..74.645 rows=308 loops=1)
         Index Cond: ((user_id = 860862404) AND (seq > 3562974))
         Filter: ((reduce_key IS NULL) OR (SubPlan 1))
         Rows Removed by Filter: 692
         SubPlan 1
           ->  GroupAggregate  (cost=0.57..447.59 rows=1 width=33) (actual time=0.096..0.097 rows=1 loops=730)
                 Group Key: user_sequence.reduce_key
                 ->  Index Only Scan using user_sequence_user_id_reduce_key_seq_idx on user_sequence  (cost=0.57..4
47.03 rows=110 width=33) (actual time=0.036..0.081 rows=68 loops=730)
                       Index Cond: ((user_id = 860862404) AND (reduce_key = x2.reduce_key) AND (seq >= x2.seq))
                       Heap Fetches: 49371
 Planning time: 0.808 ms
 Execution time: 74.890 ms
(13 rows)

I think that problem is with filter but how to make it use index?

I am using PostgreSQL 9.4.

  • 3
    Your description is not clear and seems to disagree with the query you display. Can you clarify your description? Or are you sure the query is working as intended (just not fast enough)? And always provide your version of Postgres please. And please also provide some essential statistics: How many distinct user_id, how many distinct reduce_key per user user_id (min /max/avg)? How many rows per (user_id, reduce_key) and how many total? How many rows with reduce_key IS NULL? The best query depends on data distribution. – Erwin Brandstetter Jan 30 '16 at 14:18

1 Answers1

2

Assuming you want what your current query does (which seems different from what your description says).

You ask:

I think that problem is with filter but how to make it use index?

Your query plan shows that Postgres is already using indexes for every step of the way. The added FILTER step only filters rows according to your additional predicates.

I posted a different idea at first that was missing ORDER BY seq. This query using a simple NOT EXISTS anti-semi-join should be a better equivalent:

SELECT user_id
     , seq
     , timestamp
     , reduce_key
     , mapping
FROM   user_sequence u
WHERE  user_id = 860862404
AND    seq > 3562974
AND    NOT EXISTS (
   SELECT 1
   FROM   user_sequence
   WHERE  user_id = u.user_id
   AND    reduce_key = u.reduce_key
   AND    seq > u.seq
   )
ORDER  BY seq
LIMIT  1000;

Your existing indexes look good for it. The second one might be a bit more efficient with descending sort order for the last item:

user_sequence_user_id_reduce_key_seq_idx btree (user_id, reduce_key, seq DESC)

Normally, there would be a simpler alternative with DISTINCT ON, but that would fold all rows with reduce_key IS NULL into a single row while you want them all.

Related answer with more explanation:

Aside: You have a lot of unnecessary parentheses and double-quotes. Maybe an inefficient ORM?

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
  • You have a lot of unnecessary parentheses and double-quotes. Maybe an inefficient ORM

    yes, I am using https://slick.typesafe.com. Beautified it before posting.

    Thanks for such a comprehensive answer! But I have one doubt: will your query skip row if there is another one with the same reduceKey but which is not included in result because of LIMIT 1000. I suppose it will. Moreover my version maybe have such problem too...

    – Andrey Kuznetsov Feb 18 '16 at 13:17
  • @AndreyKuznetsov: The LIMIT clause is applied last in this query. Consider the sequence of events described here. So, yes, only the row with the greatest seq per (user_id, reduce_key) will be returned. (Except for the special case reduce_key IS NULL.) – Erwin Brandstetter Feb 18 '16 at 13:49