PostgreSQL - maximum of sum over different months with ties over multiple years

Question

This question is the PostgreSQL version of a question about MySQL 5.6 here. Originally, it was one question for both RDBMS's but it was suggested to me that, given the different capabilities of the two systems, I should split the question - in particular I think that CTEs (WITH clause) should make the query far more elegant and readable!

Suppose I have a list of tumours (this data is simulated from real data):

CREATE table illness (nature_of_illness VARCHAR(25), created_at DATETIME);

INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Lung',   '2018-01-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Cervix', '2018-02-03 17:50:32');
-- 2017, with 1 Cervix and Lung each for the month of Jan - tie!
INSERT INTO illness VALUES ('Cervix', '2017-01-03 15:45:40');
INSERT INTO illness VALUES ('Lung',   '2017-01-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2017-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2017-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2017-02-03 17:50:32');
INSERT INTO illness VALUES ('Cervix', '2017-02-03 17:50:32');

You want to find out which particular tumour was most common in a given month - so far so good!

Now, you will notice that for month 1 of 2017, there is a tie - so it makes no sense whatsoever to randomly pick one and give that as the answer - so ties have to be included - this makes the problem much more challenging.

I have a solution but it's quite complex - I'd like to know if my solution is optimal or not. The PostgreSQL fiddle is here! The query in the fiddle is very cumbersome - I'll have a look at using CTE's.

My first answer (which works with both PostgreSQL and MySQL) is included in the fiddle but I won't post it here as I believe that it will be superceded by PostgreSQL's superior capabilities and it would be just a copy of my answer to the MySQL question!

optimal? For performance, short code, readability, elegance, portability? What is your first priority? How many rows, and how many distinct kinds of tumour? And is there a table tumour with 1 row per existing tumour? And what's your version of Postgres? — Erwin Brandstetter, May 07 '18 at 16:15
Also, you ask for a given month, but query in the fiddle returns all months. Big difference. Which is it? — Erwin Brandstetter, May 07 '18 at 16:21

score 5 · Accepted Answer · edited Jun 15 '20 at 09:05

For a given month:

SELECT tumour_count, illness
FROM (
   SELECT count(*) AS tumour_count, illness
        , rank() OVER (ORDER BY count(*) DESC) AS rnk
   FROM   illness
   WHERE  created_at  >= '2017-01-01'  -- given month: 2007-01
   AND    created_at  <  '2017-02-01'  -- optimized for index lookup
   GROUP  BY illness
   ) sub
WHERE  rnk = 1;

There should be an index on (created_at), or maybe even (created_at, illness) to allow index-only scans.

Subqueries are a bit faster than CTEs in Postgres. So only use CTEs where you need them, or when performance is not important.

As requested in the comment:

SELECT to_char(mon, 'YYYY-MM') AS month, tumour_count, illness
FROM  (
   SELECT date_trunc('month', created_at) AS mon
        , illness
        , count(*) AS tumour_count
        , rank() OVER (PARTITION BY date_trunc('month', created_at)
                       ORDER BY count(*) DESC) AS rnk
   FROM   illness
   WHERE  created_at  >= '2017-01-01'  -- period from 2007-01 to 2019-01
   AND    created_at  <  '2019-02-01'
   GROUP  BY 1, 2
   ) sub
WHERE  rnk = 1
ORDER  BY mon, illness;

Careful if you have leading or dangling partial months, the counts may be misleading.

This is functionally equivalent to what ypercube already provided. Just some simplifications to be a bit shorter / faster. And the added filter for a given period of time.

Index support becomes less important with the growing share of rows read from the table - and stops being useful at all for more than roughly 5 %. (Exceptions apply, like for index-only scans.)

You can still have 1 row per month, by aggregating tying peers. Like:

SELECT to_char(mon, 'YYYY-MM') AS month, tumour_count, string_agg(illness, ' | ')
FROM  (
   SELECT date_trunc('month', created_at) AS mon
        , illness
        , count(*) AS tumour_count
        , rank() OVER (PARTITION BY date_trunc('month', created_at)
                       ORDER BY count(*) DESC) AS rnk
   FROM   illness
   WHERE  created_at  >= '2017-01-01'  -- period from 2007-01 to 2019-01
   AND    created_at  <  '2019-02-01'
   GROUP  BY 1, 2
   ) sub
WHERE  rnk = 1
GROUP  BY mon, tumour_count
ORDER  BY mon;

db<>fiddle here

Thanks for the input - that indeed does work for the month of January 2017, but if I change the later date to 2019, it does a sum of all tumours - 8 of each - I require a count by month with ties also displayed - over the entire period - the result I want is here — Vérace, May 07 '18 at 16:41
@Vérace: The short query works for a given month (note I added the missing cond. on rnk). If you want to query multiple months at once, you need to do more - like @ypercube demonstrated. — Erwin Brandstetter, May 07 '18 at 16:45

score 4 · Answer 2 · answered May 07 '18 at 16:15

Using the window RANK() function and DATE_TRUNC() (so we don't use EXTRACT() twice):

  SELECT  
    EXTRACT(YEAR  FROM year_month) AS c_year,
    EXTRACT(MONTH FROM year_month) AS c_month,
    nature_of_illness,
    month_count
  FROM
    (
      SELECT 
        nature_of_illness,
        DATE_TRUNC('month', created_at) AS year_month,
        COUNT(*) AS month_count,
        RANK() OVER (PARTITION BY DATE_TRUNC('month', created_at) 
                     ORDER BY COUNT(*) DESC)
          AS rnk
      FROM illness
      GROUP BY 
        DATE_TRUNC('month', created_at),
        nature_of_illness
    ) AS t
  WHERE rnk = 1 
  ORDER BY 
    year_month, nature_of_illness ;

Test at dbfiddle.uk.

score 2 · Answer 3 · answered May 07 '18 at 17:25

I took a crack at this and had the same general idea that ypercube did. I think his is better but I am submitting this one in case you find it interesting, and because I had fun doing it :)

SELECT
    C_YEAR,
    C_MONTH,
    NATURE_OF_ILLNESS,
    MONTH_COUNT
FROM(
SELECT
    EXTRACT(YEAR FROM CREATED_AT) C_YEAR,
    EXTRACT(MONTH FROM CREATED_AT) C_MONTH,
    NATURE_OF_ILLNESS,
    COUNT(NATURE_OF_ILLNESS) MONTH_COUNT,
    MAX(COUNT(NATURE_OF_ILLNESS)) OVER (
        PARTITION BY EXTRACT(YEAR FROM CREATED_AT),  
        EXTRACT(MONTH FROM CREATED_AT)) MAX_MONTH_COUNT
FROM 
    ILLNESS
GROUP BY 
    EXTRACT(YEAR FROM CREATED_AT),
    EXTRACT(MONTH FROM CREATED_AT),
    NATURE_OF_ILLNESS
) AS SICKNESS

WHERE MONTH_COUNT = MAX_MONTH_COUNT
ORDER BY 
    C_YEAR ASC, 
    C_MONTH ASC

PostgreSQL - maximum of sum over different months with ties over multiple years

3 Answers3

For a given month:

For any given period of time

Linked