1

This question is the PostgreSQL version of a question about MySQL 5.6 here. Originally, it was one question for both RDBMS's but it was suggested to me that, given the different capabilities of the two systems, I should split the question - in particular I think that CTEs (WITH clause) should make the query far more elegant and readable!

Suppose I have a list of tumours (this data is simulated from real data):

CREATE table illness (nature_of_illness VARCHAR(25), created_at DATETIME);

INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Lung',   '2018-01-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Cervix', '2018-02-03 17:50:32');
-- 2017, with 1 Cervix and Lung each for the month of Jan - tie!
INSERT INTO illness VALUES ('Cervix', '2017-01-03 15:45:40');
INSERT INTO illness VALUES ('Lung',   '2017-01-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2017-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2017-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2017-02-03 17:50:32');
INSERT INTO illness VALUES ('Cervix', '2017-02-03 17:50:32');

You want to find out which particular tumour was most common in a given month - so far so good!

Now, you will notice that for month 1 of 2017, there is a tie - so it makes no sense whatsoever to randomly pick one and give that as the answer - so ties have to be included - this makes the problem much more challenging.

I have a solution but it's quite complex - I'd like to know if my solution is optimal or not. The PostgreSQL fiddle is here! The query in the fiddle is very cumbersome - I'll have a look at using CTE's.

My first answer (which works with both PostgreSQL and MySQL) is included in the fiddle but I won't post it here as I believe that it will be superceded by PostgreSQL's superior capabilities and it would be just a copy of my answer to the MySQL question!

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
Vérace
  • 29,825
  • 9
  • 70
  • 84
  • optimal? For performance, short code, readability, elegance, portability? What is your first priority? How many rows, and how many distinct kinds of tumour? And is there a table tumour with 1 row per existing tumour? And what's your version of Postgres? – Erwin Brandstetter May 07 '18 at 16:15
  • Also, you ask for a given month, but query in the fiddle returns all months. Big difference. Which is it? – Erwin Brandstetter May 07 '18 at 16:21

3 Answers3

5

For a given month:

SELECT tumour_count, illness
FROM (
   SELECT count(*) AS tumour_count, illness
        , rank() OVER (ORDER BY count(*) DESC) AS rnk
   FROM   illness
   WHERE  created_at  >= '2017-01-01'  -- given month: 2007-01
   AND    created_at  <  '2017-02-01'  -- optimized for index lookup
   GROUP  BY illness
   ) sub
WHERE  rnk = 1;

There should be an index on (created_at), or maybe even (created_at, illness) to allow index-only scans.

Subqueries are a bit faster than CTEs in Postgres. So only use CTEs where you need them, or when performance is not important.

Related:

For any given period of time

As requested in the comment:

SELECT to_char(mon, 'YYYY-MM') AS month, tumour_count, illness
FROM  (
   SELECT date_trunc('month', created_at) AS mon
        , illness
        , count(*) AS tumour_count
        , rank() OVER (PARTITION BY date_trunc('month', created_at)
                       ORDER BY count(*) DESC) AS rnk
   FROM   illness
   WHERE  created_at  >= '2017-01-01'  -- period from 2007-01 to 2019-01
   AND    created_at  <  '2019-02-01'
   GROUP  BY 1, 2
   ) sub
WHERE  rnk = 1
ORDER  BY mon, illness;

Careful if you have leading or dangling partial months, the counts may be misleading.

This is functionally equivalent to what ypercube already provided. Just some simplifications to be a bit shorter / faster. And the added filter for a given period of time.

Index support becomes less important with the growing share of rows read from the table - and stops being useful at all for more than roughly 5 %. (Exceptions apply, like for index-only scans.)

You can still have 1 row per month, by aggregating tying peers. Like:

SELECT to_char(mon, 'YYYY-MM') AS month, tumour_count, string_agg(illness, ' | ')
FROM  (
   SELECT date_trunc('month', created_at) AS mon
        , illness
        , count(*) AS tumour_count
        , rank() OVER (PARTITION BY date_trunc('month', created_at)
                       ORDER BY count(*) DESC) AS rnk
   FROM   illness
   WHERE  created_at  >= '2017-01-01'  -- period from 2007-01 to 2019-01
   AND    created_at  <  '2019-02-01'
   GROUP  BY 1, 2
   ) sub
WHERE  rnk = 1
GROUP  BY mon, tumour_count
ORDER  BY mon;

db<>fiddle here

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
  • Thanks for the input - that indeed does work for the month of January 2017, but if I change the later date to 2019, it does a sum of all tumours - 8 of each - I require a count by month with ties also displayed - over the entire period - the result I want is here – Vérace May 07 '18 at 16:41
  • @Vérace: The short query works for a given month (note I added the missing cond. on rnk). If you want to query multiple months at once, you need to do more - like @ypercube demonstrated. – Erwin Brandstetter May 07 '18 at 16:45
4

Using the window RANK() function and DATE_TRUNC() (so we don't use EXTRACT() twice):

  SELECT  
    EXTRACT(YEAR  FROM year_month) AS c_year,
    EXTRACT(MONTH FROM year_month) AS c_month,
    nature_of_illness,
    month_count
  FROM
    (
      SELECT 
        nature_of_illness,
        DATE_TRUNC('month', created_at) AS year_month,
        COUNT(*) AS month_count,
        RANK() OVER (PARTITION BY DATE_TRUNC('month', created_at) 
                     ORDER BY COUNT(*) DESC)
          AS rnk
      FROM illness
      GROUP BY 
        DATE_TRUNC('month', created_at),
        nature_of_illness
    ) AS t
  WHERE rnk = 1 
  ORDER BY 
    year_month, nature_of_illness ;

Test at dbfiddle.uk.

ypercubeᵀᴹ
  • 97,895
  • 13
  • 214
  • 305
2

I took a crack at this and had the same general idea that ypercube did. I think his is better but I am submitting this one in case you find it interesting, and because I had fun doing it :)

SELECT
    C_YEAR,
    C_MONTH,
    NATURE_OF_ILLNESS,
    MONTH_COUNT
FROM(
SELECT
    EXTRACT(YEAR FROM CREATED_AT) C_YEAR,
    EXTRACT(MONTH FROM CREATED_AT) C_MONTH,
    NATURE_OF_ILLNESS,
    COUNT(NATURE_OF_ILLNESS) MONTH_COUNT,
    MAX(COUNT(NATURE_OF_ILLNESS)) OVER (
        PARTITION BY EXTRACT(YEAR FROM CREATED_AT),  
        EXTRACT(MONTH FROM CREATED_AT)) MAX_MONTH_COUNT
FROM 
    ILLNESS
GROUP BY 
    EXTRACT(YEAR FROM CREATED_AT),
    EXTRACT(MONTH FROM CREATED_AT),
    NATURE_OF_ILLNESS
) AS SICKNESS

WHERE MONTH_COUNT = MAX_MONTH_COUNT
ORDER BY 
    C_YEAR ASC, 
    C_MONTH ASC
Joe
  • 21
  • 2