How to properly implement compound greatest-n filtering

Question

Yep, more greatest-n-per-group questions.

Given the a table releases with the following columns:

 id         | primary key                 | 
 volume     | double precision            |
 chapter    | double precision            |
 series     | integer-foreign-key         |
 include    | boolean                     | not null

I want to select the compound max of volume, then chapter for a set of series.

Right now, if I query per-distinct-series, I can easily accomplish this as follows:

SELECT 
       releases.chapter AS releases_chapter,
       releases.include AS releases_include,
       releases.series AS releases_series
FROM releases
WHERE releases.series = 741
  AND releases.include = TRUE
ORDER BY releases.volume DESC NULLS LAST, releases.chapter DESC NULLS LAST LIMIT 1;

However, if I have a large set of series (and I do), this quickly runs into efficiency issues where I'm issuing 100+ queries to generate a single page.

I'd like to roll the whole thing into a single query, where I can simply say WHERE releases.series IN (1,2,3....), but I haven't figured out how to convince Postgres to let me do that.

The naive approach would be:

SELECT releases.volume AS releases_volume,
       releases.chapter AS releases_chapter,
       releases.series AS releases_series
FROM 
    releases
WHERE 
    releases.series IN (12, 17, 44, 79, 88, 110, 129, 133, 142, 160, 193, 231, 235, 295, 340, 484, 499, 
                        556, 581, 664, 666, 701, 741, 780, 790, 796, 874, 930, 1066, 1091, 1135, 1137, 
                        1172, 1331, 1374, 1418, 1435, 1447, 1471, 1505, 1521, 1540, 1616, 1702, 1768, 
                        1825, 1828, 1847, 1881, 2007, 2020, 2051, 2085, 2158, 2183, 2190, 2235, 2255, 
                        2264, 2275, 2325, 2333, 2334, 2337, 2341, 2343, 2348, 2370, 2372, 2376, 2606, 
                        2634, 2636, 2695, 2696 )
  AND releases.include = TRUE
GROUP BY 
    releases_series
ORDER BY releases.volume DESC NULLS LAST, releases.chapter DESC NULLS LAST;

Which obviously doesn't work:

ERROR:  column "releases.volume" must appear in the 
        GROUP BY clause or be used in an aggregate function

Without the GROUP BY, it does fetch everything, and with some simple procedural filtering it would even work, but there must be a "proper" way to do this in SQL.

Following the errors, and adding aggregates:

SELECT max(releases.volume) AS releases_volume,
       max(releases.chapter) AS releases_chapter,
       releases.series AS releases_series
FROM 
    releases
WHERE 
    releases.series IN (12, 17, 44, 79, 88, 110, 129, 133, 142, 160, 193, 231, 235, 295, 340, 484, 499, 
                        556, 581, 664, 666, 701, 741, 780, 790, 796, 874, 930, 1066, 1091, 1135, 1137, 
                        1172, 1331, 1374, 1418, 1435, 1447, 1471, 1505, 1521, 1540, 1616, 1702, 1768, 
                        1825, 1828, 1847, 1881, 2007, 2020, 2051, 2085, 2158, 2183, 2190, 2235, 2255, 
                        2264, 2275, 2325, 2333, 2334, 2337, 2341, 2343, 2348, 2370, 2372, 2376, 2606, 
                        2634, 2636, 2695, 2696 )
  AND releases.include = TRUE
GROUP BY 
    releases_series;

Mostly works, but the issue is that the two maximums aren't coherent. If I have two rows, one where volume:chapter are 1:5, and 4:1, I need to return 4:1, but the independent maximums return 4:5.

Frankly, this would be so simple to implement in my application code that I have to be missing something obvious here. How can I implement a query that actually satisfies my requirements?

score 3 · Accepted Answer · edited May 23 '17 at 12:40

3

The simple solution in Postgres is with DISTINCT ON:

SELECT DISTINCT ON (r.series)
       r.volume  AS releases_volume
     , r.chapter AS releases_chapter
     , r.series  AS releases_series
FROM   releases r
WHERE  r.series IN (
    12, 17, 44, 79, 88, 110, 129, 133, 142, 160, 193, 231, 235, 295, 340, 484, 499
  , 556, 581, 664, 666, 701, 741, 780, 790, 796, 874, 930, 1066, 1091, 1135, 1137
  , 1172, 1331, 1374, 1418, 1435, 1447, 1471, 1505, 1521, 1540, 1616, 1702, 1768
  , 1825, 1828, 1847, 1881, 2007, 2020, 2051, 2085, 2158, 2183, 2190, 2235, 2255
  , 2264, 2275, 2325, 2333, 2334, 2337, 2341, 2343, 2348, 2370, 2372, 2376, 2606
  , 2634, 2636, 2695, 2696)
AND    r.include
ORDER  BY r.series, r.volume DESC NULLS LAST, r.chapter DESC NULLS LAST;

Details:

Select first row in each GROUP BY group?

Depending on data distribution there may be faster techniques:

Optimize GROUP BY query to retrieve latest record per user

Also, there are faster alternatives for long lists than IN ().

Combining an unnested array with a LATERAL join:

SELECT r.*
FROM   unnest('{12, 17, 44, 79, 88, 110, 129}'::int[]) t(i)  -- or many more items
     , LATERAL (
   SELECT volume  AS releases_volume
        , chapter AS releases_chapter
        , series  AS releases_series
   FROM   releases
   WHERE  series = t.i 
   AND    include
   ORDER  BY series, volume DESC NULLS LAST, chapter DESC NULLS LAST
   LIMIT  1
   ) r;

Is often faster. For best performance you need a matching multicolumn index like:

CREATE INDEX releases_series_volume_chapter_idx
ON releases(series, volume DESC NULLS LAST, chapter DESC NULLS LAST);

Extremely slow query on indexed column

And if there are more than a few rows where include is not true, while you are only interested in the rows with include = true, then consider a partial multicolumn index:

CREATE INDEX releases_series_volume_chapter_idx
ON releases(series, volume DESC NULLS LAST, chapter DESC NULLS LAST)
WHERE include;

edited May 23 '17 at 12:40

Community

1

answered Sep 08 '15 at 03:35

Erwin Brandstetter

175,982
27
439
600

Wow... That was... easy. Huh. And I spent 2+ hours digging through docs for something like that to no avail. – Fake Name Sep 09 '15 at 01:26
As an aside, the lateral join doesn't seem to make that big a difference, but it's possible that's just because my dataset is too small at the moment: pastebin.
Incidentally, is that "comma preceding statement" query notation a standard? It's the most horrifically ugly code formatting I've seen (IMHO, at least).
– Fake Name Sep 09 '15 at 01:33
Note that in the pastebin EXPLAIN ANALYZE above, I see ~4 ms of execution time jitter (the system is fairly loaded continuously), so basically either option is down in the measurement noise in terms of execution times. The table I'm querying on has ~718K rows, and I'm selecting across a set of 75 PKs. – Fake Name Sep 09 '15 at 01:35
@FakeName: The best query depends on data distribution. LATERAL should shine for few PKs with many rows each. More explanation in the added links. Leading commas are just my personal preference (but I am not alone). Useful for code readability and it's easy to add / remove items from the list. Beauty lies in the eye of the beholder ... There is always jitter in timing. Compare best of 5 or more to compare warm cache to warm cache and ignore freak values. 1.7 / 2.7 ms doesn't seem too bad for the query. Are you using the index I suggested? I added one more idea. – Erwin Brandstetter Sep 09 '15 at 02:11
~2-5 ms is great. Previously, the system I had was ~10 ms per item (totally naive implementation, probably mostly planner overhead)! I do have an index on series, but I already have a bunch of indices on this table, and I'd like to avoid adding too many. The most items a single PK will have is maybe 100, with a general distribution around 10-20, so I think indexing on those wouldn't help that much. Does that make sense? There are very, very few cases where include == true. – Fake Name Sep 09 '15 at 02:18
Yeah, always try to keep the number of indexes down. You have to weigh cost and benefit, and fewer indexes reside in cache more easily. But if you have very, very few cases where include == true, then the partial index I added to my answer should pay, and it would be very small too (only contains few rows). – Erwin Brandstetter Sep 09 '15 at 02:33
Arrgh, derp, sorry, very few cases where include == false. As in, maybe 100 rows in the whole DB where it's false. – Fake Name Sep 09 '15 at 02:35
Oh, then partial index is useless. The few rows are filtered cheaply. – Erwin Brandstetter Sep 09 '15 at 02:35

How to properly implement compound greatest-n filtering

1 Answers1

Linked