Yep, more greatest-n-per-group questions.
Given the a table releases with the following columns:
id | primary key |
volume | double precision |
chapter | double precision |
series | integer-foreign-key |
include | boolean | not null
I want to select the compound max of volume, then chapter for a set of series.
Right now, if I query per-distinct-series, I can easily accomplish this as follows:
SELECT
releases.chapter AS releases_chapter,
releases.include AS releases_include,
releases.series AS releases_series
FROM releases
WHERE releases.series = 741
AND releases.include = TRUE
ORDER BY releases.volume DESC NULLS LAST, releases.chapter DESC NULLS LAST LIMIT 1;
However, if I have a large set of series (and I do), this quickly runs into efficiency issues where I'm issuing 100+ queries to generate a single page.
I'd like to roll the whole thing into a single query, where I can simply say WHERE releases.series IN (1,2,3....), but I haven't figured out how to convince Postgres to let me do that.
The naive approach would be:
SELECT releases.volume AS releases_volume,
releases.chapter AS releases_chapter,
releases.series AS releases_series
FROM
releases
WHERE
releases.series IN (12, 17, 44, 79, 88, 110, 129, 133, 142, 160, 193, 231, 235, 295, 340, 484, 499,
556, 581, 664, 666, 701, 741, 780, 790, 796, 874, 930, 1066, 1091, 1135, 1137,
1172, 1331, 1374, 1418, 1435, 1447, 1471, 1505, 1521, 1540, 1616, 1702, 1768,
1825, 1828, 1847, 1881, 2007, 2020, 2051, 2085, 2158, 2183, 2190, 2235, 2255,
2264, 2275, 2325, 2333, 2334, 2337, 2341, 2343, 2348, 2370, 2372, 2376, 2606,
2634, 2636, 2695, 2696 )
AND releases.include = TRUE
GROUP BY
releases_series
ORDER BY releases.volume DESC NULLS LAST, releases.chapter DESC NULLS LAST;
Which obviously doesn't work:
ERROR: column "releases.volume" must appear in the GROUP BY clause or be used in an aggregate function
Without the GROUP BY, it does fetch everything, and with some simple procedural filtering it would even work, but there must be a "proper" way to do this in SQL.
Following the errors, and adding aggregates:
SELECT max(releases.volume) AS releases_volume,
max(releases.chapter) AS releases_chapter,
releases.series AS releases_series
FROM
releases
WHERE
releases.series IN (12, 17, 44, 79, 88, 110, 129, 133, 142, 160, 193, 231, 235, 295, 340, 484, 499,
556, 581, 664, 666, 701, 741, 780, 790, 796, 874, 930, 1066, 1091, 1135, 1137,
1172, 1331, 1374, 1418, 1435, 1447, 1471, 1505, 1521, 1540, 1616, 1702, 1768,
1825, 1828, 1847, 1881, 2007, 2020, 2051, 2085, 2158, 2183, 2190, 2235, 2255,
2264, 2275, 2325, 2333, 2334, 2337, 2341, 2343, 2348, 2370, 2372, 2376, 2606,
2634, 2636, 2695, 2696 )
AND releases.include = TRUE
GROUP BY
releases_series;
Mostly works, but the issue is that the two maximums aren't coherent. If I have two rows, one where volume:chapter are 1:5, and 4:1, I need to return 4:1, but the independent maximums return 4:5.
Frankly, this would be so simple to implement in my application code that I have to be missing something obvious here. How can I implement a query that actually satisfies my requirements?
lateraljoin doesn't seem to make that big a difference, but it's possible that's just because my dataset is too small at the moment: pastebin.Incidentally, is that "comma preceding statement" query notation a standard? It's the most horrifically ugly code formatting I've seen (IMHO, at least).
– Fake Name Sep 09 '15 at 01:33EXPLAIN ANALYZEabove, I see ~4 ms of execution time jitter (the system is fairly loaded continuously), so basically either option is down in the measurement noise in terms of execution times. The table I'm querying on has ~718K rows, and I'm selecting across a set of 75 PKs. – Fake Name Sep 09 '15 at 01:35LATERALshould shine for few PKs with many rows each. More explanation in the added links. Leading commas are just my personal preference (but I am not alone). Useful for code readability and it's easy to add / remove items from the list. Beauty lies in the eye of the beholder ... There is always jitter in timing. Compare best of 5 or more to compare warm cache to warm cache and ignore freak values. 1.7 / 2.7 ms doesn't seem too bad for the query. Are you using the index I suggested? I added one more idea. – Erwin Brandstetter Sep 09 '15 at 02:11series, but I already have a bunch of indices on this table, and I'd like to avoid adding too many. The most items a single PK will have is maybe 100, with a general distribution around 10-20, so I think indexing on those wouldn't help that much. Does that make sense? There are very, very few cases whereinclude == true. – Fake Name Sep 09 '15 at 02:18very, very few cases where include == true, then the partial index I added to my answer should pay, and it would be very small too (only contains few rows). – Erwin Brandstetter Sep 09 '15 at 02:33include == false. As in, maybe 100 rows in the whole DB where it's false. – Fake Name Sep 09 '15 at 02:35