DISTINCT ON vs ROW_NUMBER() for getting a row with a maximum value of a group

Question

I have the following table in PostgreSQL 9.4:

CREATE TABLE dpa(
  id serial NOT NULL,
  currency_id integer,
  amount numeric(14,3),
  date timestamp without time zone,
  plat_id integer,
  pl_id integer,
  p_id integer,
  CONSTRAINT dpa_pkey PRIMARY KEY (id), 
)

and settings:

work_mem = 128MB
table_size = 16 MB

And index:

CREATE INDEX idx1
  ON dpa
  USING btree
  (plat_id, p_id, pl_id, currency_id, date DESC NULLS LAST, amount)

The table consists of approximately 242K rows. I don't have NOT NULL constraints on the column, but they are actually NOT NULL.

Now, I'm measuring performance of the queries:

I

SELECT plat_id, p_id, pl_id, player_account player_account
FROM(
    SELECT plat_id, p_id, pl_id, 
    COALESCE(amount, 0) player_account,
    ROW_NUMBER() OVER (PARTITION BY plat_id, p_id, pl_id, currency_id
                       ORDER BY date DESC NULLS LAST) rn
    FROM dpa
) sub WHERE rn = 1;

Analyzed plan:

Subquery Scan on sub  (cost=0.42..25484.16 rows=1214 width=44) (actual time=0.044..296.810 rows=215274 loops=1)
  Filter: (sub.rn = 1)
  Rows Removed by Filter: 27556
  ->  WindowAgg  (cost=0.42..22448.79 rows=242830 width=28) (actual time=0.043..255.690 rows=242830 loops=1)
        ->  Index Only Scan using idx1 on dpa  (cost=0.42..16378.04 rows=242830 width=28) (actual time=0.037..91.576 rows=242830 loops=1)"
              Heap Fetches: 242830

II

SELECT DISTINCT ON(plat_id, p_id, pl_id, currency_id)
       plat_id, p_id, pl_id, currency_id, amount
FROM dpa
ORDER BY plat_id, p_id, pl_id, currency_id, date DESC NULLS LAST

Analyzed plan:

Unique  (cost=0.42..18794.73 rows=82273 width=28) (actual time=0.017..128.277 rows=215274 loops=1)
  ->  Index Only Scan using idx1 on dpa  (cost=0.42..16366.43 rows=242830 width=28) (actual time=0.016..72.110 rows=242830 loops=1)
        Heap Fetches: 242830

As can be seen, the second query is faster than the first one. But when I execute this queries in PGAdmin I got the following average statistics:

The query with ROW_NUMBER()(the first): 4999 ms

The query with DISTINCT ON(the second): 5654 ms

I understand that bandwith/latency overhead on such large result set is significant. All queries produce 215274 rows.

QUESTION: Why does it take more time to receive all rows in the second case than in the first one, although the planner shows that the second plan is more optimal?

Hm, the timing you see might be given by pgAdmin - that means it displays the time it needed to get and render the output. As you know the time needed for the backend to get the data, the difference you see must come from the transfer and/or rendering. You show less columns in the first query, that might be a reason, for example. — András Váczi, Oct 22 '15 at 15:56
I think dezso has a point. The remaining difference: COALESCE(amount, 0) in the first query - so pgAdmin does not have to render NULL values. BTW, if you are looking for best performance and there are many rows per group, a similar query like the one I gave you for your related question yesterday might be fastest - in combination with a matching index. For few rows per group, DISTINCT ON should be fastest: http://stackoverflow.com/a/7630564/939860 — Erwin Brandstetter, Oct 22 '15 at 16:19
would a LIMIT 1 in the inner query make more sense? Perhaps with a GROUP BY (rather than using a DISTINCT clause) — mpag, Oct 22 '15 at 16:44
@mpag: No. The inner query in query I needs to return one row per group, not just one row total. We could use LIMIT 1 in a LATERAL join, like I outlined in the answer I linked to in my previous comment. More details here: http://stackoverflow.com/a/25536748/939860 — Erwin Brandstetter, Oct 23 '15 at 01:30
@dezso It seems you're right. PGAdmin is not a good tool to measure the time spending for transfering. Couldn't you suggest a tool for this? Maybe there's some pg-extension or something like that... — St.Antario, Oct 23 '15 at 06:31
@dezso So, for instance, If I need to measure the timing for a web-server I will have to write some code explcitly in order for to track this? — St.Antario, Oct 23 '15 at 08:00

András Váczi · Accepted Answer · 2016-01-25T19:15:14.100

The timing you see is given by pgAdmin (but it could be any other client) - that means it displays the time it needed to get and render the output. As you know the time needed for the database to produce the data (by using EXPLAIN ANALYZE), the difference you see must come from the transfer and/or rendering. You show less columns in the first query, that might be a reason, for example.

If you want to have an idea about how much time is consumed during the transfer, you can time the execution of the query from an application. If you only fetch the data but don't process it in any way (not to mention render it), you get a good approximation on the time needed for transferring the data. Just take the time needed for fetching the data and subtract the (already known) execution time of the database.

This way you will have an idea about the rendering time of pgAdmin, too, by comparing the above numbers to the one it gives you.

Actually, rendering is the key thing here. – St.Antario Oct 23 '15 at 11:57 — St.Antario, Oct 23 '15 at 11:57

DISTINCT ON vs ROW_NUMBER() for getting a row with a maximum value of a group

I

II

1 Answers1

Linked