Optimizing expensive GROUP BY / ORDER BY query on 690,000 row table

Question

I am writing a query to calculate hiscore rankings for a game. It performs a GROUP BY to aggregate hiscore statistics for individual player stats that are stored across 23 different rows (separated by different "skills").

The query I came up with below works:

SELECT * FROM (
    SELECT
        username,
        SUM(level) total_level,
        SUM(experience) total_experience,
        (SELECT rank FROM overall_hiscore_rankings WHERE username = skill_hiscore.username) rank
        FROM skill_hiscore GROUP BY username ORDER BY rank
     ) a WHERE rank >= 1 AND rank <= 25;

Note: overall_hiscore_rankings is a materialized view and the filter at the end is my implementation of the seek method for pagination to prevent expensive LIMIT OFFSET pagination.

I did not notice any problems with this until seeding about ~30,000 players worth of data into my hiscores schema. Because each player has 23 different rows as mentioned above, this leaves us with a table with 690,000 rows.

This now gives us atrocious performance, literally taking my db 100 seconds to execute this query:

Sort  (cost=35926899.17..35926907.50 rows=3330 width=35) (actual time=98599.592..98599.595 rows=25 loops=1)
  Sort Key: ((SubPlan 1))
  Sort Method: quicksort  Memory: 26kB
  ->  GroupAggregate  (cost=0.42..35926704.35 rows=3330 width=35) (actual time=149.023..98599.514 rows=25 loops=1)
        Group Key: skill_hiscore.username
        Filter: (((SubPlan 2) >= 1) AND ((SubPlan 3) <= 25))
        Rows Removed by Filter: 29978
        ->  Index Scan using skill_hiscore_username on skill_hiscore  (cost=0.42..46884.88 rows=690063 width=19) (actual time=144.644..293.859 rows=690063 loops=1)
        SubPlan 1
          ->  Seq Scan on overall_hiscore_rankings  (cost=0.00..567.04 rows=1 width=8) (actual time=0.002..1.592 rows=1 loops=25)
                Filter: ((username)::text = (skill_hiscore.username)::text)
                Rows Removed by Filter: 30002
        SubPlan 2
          ->  Seq Scan on overall_hiscore_rankings overall_hiscore_rankings_1  (cost=0.00..567.04 rows=1 width=8) (actual time=0.817..1.633 rows=1 loops=30003)
                Filter: ((username)::text = (skill_hiscore.username)::text)
                Rows Removed by Filter: 30002
        SubPlan 3
          ->  Seq Scan on overall_hiscore_rankings overall_hiscore_rankings_2  (cost=0.00..567.04 rows=1 width=8) (actual time=0.816..1.633 rows=1 loops=30003)
                Filter: ((username)::text = (skill_hiscore.username)::text)
                Rows Removed by Filter: 30002
Planning Time: 2.820 ms
JIT:
  Functions: 21
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 1.248 ms, Inlining 44.936 ms, Optimization 60.995 ms, Emission 38.520 ms, Total 145.700 ms
Execution Time: 98623.550 ms

Clearly, the ORDER BY and GROUP BY is causing the largest performance drain, but I am not sure how to rewrite this.

Nothing pops out at my that indicates this query would take 100 seconds - I have never seen a query perform so poorly, and frankly trying to optimize this is out of my knowledge.

This query is not backed by any indexes.
My main question is this: Is the lack of an index on this query for the GROUP BY and ORDER BY data enough to make the runtime of this query 100 seconds?

Your query has a syntax error, and it is not clear how to fix it. — jjanes, Mar 02 '22 at 04:33
Don't post text as images of text. Also, do EXPLAIN (ANALYZE), not just EXPLAIN. — jjanes, Mar 02 '22 at 04:34
@jjanes There is no syntax error, the query runs fine and produces the proper results. It is a performance issue. I updated OP to include EXPLAIN ANALYZE in text form. — Andys1814, Mar 02 '22 at 04:38
The query has more closing parentheses than opening ones. Of course it has a syntax error. But from the plan, the bad performance is driven by the missing index on username of the materialized view (not the table). — jjanes, Mar 02 '22 at 04:53
@jjanes fixed the syntax. For some reason stackoverflow chopped it off — Andys1814, Mar 02 '22 at 05:05
@jjanes Adding an index for this worked and reduced the query time to 0.6 seconds, which I am happy with. Can you refer to my bolded question at the end of OP and maybe elaborate on why this lack of index makes the query time so incredibly bad? Also curious on how you discovered this from the query plan. If you type this in a solution I would be more than happy to accept your answer. — Andys1814, Mar 02 '22 at 05:10
The scalar sub-query to get the rank for each row is repeated over and over again. As that sub-query only returns a single row (per username), you should try to re-write that into a JOIN. Most likely this is going to be much faster. You should create an index on the username column in both tables for that. — , Mar 02 '22 at 06:44
Your "main question" is answered by the fact that omitting the index makes the query that slow. But then perhaps you only wanted to vent your frustration. — Laurenz Albe, Mar 02 '22 at 07:21
@LaurenzAlbe No not at all, why do you think so? I am simply shocked by a 100 second query, and asking for clarification. — Andys1814, Mar 02 '22 at 18:32
It is quite normal for database queries to take a long time, and it is also quite normal to get huge performance gains by adding an index or slightly rewriting a query. I see that all the time. Does that answer your question? — Laurenz Albe, Mar 02 '22 at 18:39
As instructed in the tag description of [postgresql-performance], your question should start by declaring Postgres version and table definitions. Just follow the link, it's all there. — Erwin Brandstetter, Mar 03 '22 at 03:12

Erwin Brandstetter · Accepted Answer · 2022-03-03T04:05:52.757

The correlated subquery is a real drag on performance, and without need.
The lack of indexes, too.

Try instead:

SELECT username
     , sum(s.level) AS total_level
     , sum(s.experience) AS total_experience
     , o.rank
FROM   overall_hiscore_rankings o
JOIN   skill_hiscore s USING (username)
WHERE  o.rank >= 1
AND    o.rank <= 25
GROUP  BY username, o.rank -- ?
ORDER  BY o.rank;

If overall_hiscore_rankings.username is unique, you can just add o.rank to GROUP BY. I'd need table definitions to be sure ...

Ideally, you have indexes like:

CREATE INDEX ON overall_hiscore_rankings (rank, username);
CREATE INDEX ON skill_hiscore (username) INCLUDE (level, experience);

The covering index only makes sense if your table is vacuumed enough. Else make it on just (username) - the index skill_hiscore_username in your query plan seems to cover that. See:

Be sure to have autovacuum running, or run VACUUM ANALYZE manually once after creating your test case to get current column statistics and update the visibility map.

BTW, you mention "23 different rows (separated by different skills)", but WHERE rank >= 1 AND rank <= 25 seems to select 25 of them. Did you mean WHERE rank > 1 AND rank < 25?

This gives me an error relating to o.rank must appear in the GROUP BY clause or be used in an aggregate function, but you answer my question. — Andys1814, Mar 03 '22 at 03:53
This is precisely why I needed the sub query to select the rank, as after the join, the rank will be expected to be apart of the aggregate — Andys1814, Mar 03 '22 at 03:56
@Andys1814: You can probably just add o.rank to GROUP BY - assuming there will be a single rank per username. Or else your correlated subquery would have failed, too. — Erwin Brandstetter, Mar 03 '22 at 04:08
Yep, this works and gives me excellent query performance even compared to my initial query with an index. After adding the index, the query with the subselect average ~0.6 seconds, whereas I'm down to about 0.075 now. — Andys1814, Mar 03 '22 at 05:00

Optimizing expensive GROUP BY / ORDER BY query on 690,000 row table

1 Answers1