MySQL - maximum of sum over different months with ties over multiple years

Question

This question was inspired by this one [closed] and is virtually identical to this one but using different RDBMS's (PostgreSQL vs. MySQL).

Suppose I have a list of tumours (this data is simulated from real data):

CREATE table illness (nature_of_illness VARCHAR(25), created_at DATETIME);

INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Lung',   '2018-01-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Cervix', '2018-02-03 17:50:32');
-- 2017, with 1 Cervix and Lung each for the month of Jan - tie!
INSERT INTO illness VALUES ('Cervix', '2017-01-03 15:45:40');
INSERT INTO illness VALUES ('Lung',   '2017-01-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2017-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2017-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung',   '2017-02-03 17:50:32');
INSERT INTO illness VALUES ('Cervix', '2017-02-03 17:50:32');

You want to find out which particular tumour was most common in a given month - so far so good!

Now, you will notice that for month 1 of 2017, there is a tie - so it makes no sense whatsoever to randomly pick one and give that as the answer - so ties have to be included - this makes the problem much more challenging.

The correct answer is:

  Year    Month  Tumour count      Type
  2017        1             1    Cervix  -- note tie
  2017        1             1      Lung  --   "   "
  2017        2             3      Lung
  2018        1             5    Cervix
  2018        2             3      Lung

A further bonus would be to have the month name appear as text rather than an integer.

I have a solution but it's quite complex - I'd like to know if my solution is optimal or not. The MySQL fiddle is here!

I understand this is an SQL specific question, but this can be made much simpler by using a time series database. — Sash, May 07 '18 at 17:27
@Sash, it can be done much simpler with most SQL DBMS, including newer versions of MySQL/MariaDB. MySQL 5.6 does not implement much functionality invented after SQL92. — Lennart - Slava Ukraini, May 07 '18 at 19:07

score 5 · Accepted Answer · answered May 07 '18 at 14:25

My attempt to solve this is as follows. I would appreciate any advice on how this query could be improved:

SELECT 
  t3.c_year AS "Year",
  t3.c_month AS "Month", 
  t3.il_mc AS  "Tumour count", 
  t4.ill_nat AS "Type" FROM
(
  SELECT c_year, c_month, il_mc FROM
  (
    SELECT  
    c_year, 
    c_month,
    MAX(month_count) AS il_mc
  FROM
    (
      SELECT nature_of_illness as illness,
        EXTRACT(YEAR  FROM created_at) AS c_year,
        EXTRACT(MONTH FROM created_at) AS c_month,
        COUNT(EXTRACT(MONTH FROM created_at)) AS month_count
      FROM illness
      GROUP BY illness, c_year, c_month
      ORDER BY c_year, c_month
    ) AS t1
  GROUP BY c_year, c_month
  ) AS t2
) AS t3
JOIN
(
SELECT 
  EXTRACT(YEAR FROM created_at) AS t_year, 
  EXTRACT(MONTH FROM created_at) AS t_month,  
  nature_of_illness AS ill_nat, 
  COUNT(nature_of_illness) AS ill_cnt
FROM illness
GROUP BY t_year, t_month, nature_of_illness
ORDER BY t_year, t_month, nature_of_illness
) AS t4
ON t3.c_year = t4.t_year
AND t3.c_month = t4.t_month
AND t3.il_mc = t4.ill_cnt

And it does give the correct result, as can be seen in the fiddle here!

I don't think it's possible to do much simpler. One alternative that comes to mind is a sub-select instead of a join for getting counts that equal the maximum count for the year and date. Possible, but hardly simpler. Another option is using variables to mimic rank() over partition by ...) and hope that you have found a new job by the time the query has to be changed ;-) — Lennart - Slava Ukraini, May 07 '18 at 19:56
Hopefully we'll be on MySQL 8 before anything like that happens :-). It finally brings MySQL into the 21st century! Analytics, CTE's, proper REGEXPs - looks good - even though you can't can't do INTERSECTs and a few other gripes, but it looks like Oracle have really put a lot into this release. — Vérace, May 07 '18 at 20:26

danblack · Answer 2 · 2019-04-30T05:42:42.980

Using MySQL-8.0 and CTEs we first create tmp as the aggregate count grouping by year/month/nature_of_illness, RANK() assigns identical values to c of the same value so the duplicate max is accounted for:

 SELECT y as 'Year',mon as 'Month',c as 'Tumor Count', nature_of_illness as 'Type'
 FROM (
   WITH tmp AS ( 
    SELECT YEAR(created_at) as y, MONTH(created_at) as mon, COUNT(*) as c, nature_of_illness
    FROM illness
    GROUP BY y, mon, nature_of_illness
   )
   SELECT y, mon, c, nature_of_illness,
   RANK() OVER (PARTITION BY y, mon ORDER BY c DESC) as `rank`
   FROM tmp
 ) AS tmp2 
WHERE `rank` = 1
ORDER BY y, mon

MySQL - maximum of sum over different months with ties over multiple years

2 Answers2

Linked

Related