2

I am working with Netezza SQL.

I have the following table ("my_table"):

   gender country favorite_color disease   height   weight id
1       m  mexico           blue       n 193.6547 62.74102  1
2       m  canada            red       n 159.4800 98.77469  2
3       m     usa          green       y 186.4446 81.48848  3
4       f     usa          green       y 180.7724 81.07389  4
5       m  mexico          green       n 163.0572 88.69809  5
6       f     usa          green       y 199.8869 71.45501  6
7       f  mexico            red       y 185.2462 97.55587  7
8       f  canada            red       y 158.2372 77.69315  8
9       m  mexico           blue       y 193.5437 91.10319  9
10      m     usa           blue       y 187.6475 66.67750 10
11      f  mexico            red       y 173.8944 84.64233 11
12      f     usa           blue       y 162.8618 70.73499 12
13      f  canada            red       y 151.8939 63.65442 13
14      m  mexico           blue       y 188.8348 62.40908 14
15      f     usa            red       y 155.6472 71.84554 15

My Question: I am trying to calculate the disease rate for different groups of patients based on gender, country, favorite_color, height percentiles ... and weight percentiles within height percentiles.

This is my attempt to write a query for this problem:

WITH ntiles AS (
    SELECT 
        height,
        weight,
        gender,
        country,
        favorite_color,
        disease, 
        NTILE(5) OVER (PARTITION BY gender, country, favorite_color ORDER BY height) as height_ntile
    FROM my_table
),
ntiles2 AS (
    SELECT 
        *,
        NTILE(5) OVER (PARTITION BY gender, country, favorite_color, height_ntile ORDER BY weight) as weight_ntile
    FROM ntiles
)
SELECT 
    height_ntile, 
    weight_ntile,
    gender,
    country,
    favorite_color,
    MIN(height) as min_height, 
    MAX(height) as max_height,
    MIN(weight) as min_weight, 
    MAX(weight) as max_weight,
    COUNT(*) as count, 
    COUNT(CASE WHEN disease = 'y' THEN 1 END) as disease_count,
    COUNT(CASE WHEN disease = 'y' THEN 1 END)*100.0/COUNT(*) as disease_rate
FROM ntiles2
GROUP BY height_ntile, weight_ntile, gender, country, favorite_color;

When I look at some of the same results from this query:

    height_ntile weight_ntile gender country favorite_color min_height max_height min_weight max_weight count disease_count disease_rate
235            5            1      m     usa            red   203.9991   219.7695   50.39345   68.26760   225           108     48.00000
236            5            2      m     usa            red   204.0440   219.8029   68.31249   87.35364   224           110     49.10714
237            5            3      m     usa            red   203.9897   219.9862   87.55315  105.54138   224           113     50.44643
238            5            4      m     usa            red   204.0706   219.9600  105.62797  122.04225   224           127     56.69643
239            5            5      m     usa            red   204.0137   219.9972  122.15583  139.92434   224           102     45.53571

When looking at these results, I see that even when all variables are held as constant except for weight_ntile - the height min/max values are slightly different each time. It seems like the height_ntile is being re-calculated each time - I am not sure why this is happening because I am "fixing" this variable as constant and calculating weight_ntiles within the a pre-calculated height_ntile. Thus, shouldn't the min/max values for a given height_ntile remain constant?

I am not sure why this is happening and if there might be a way to prevent this?

Thanks!

Note: SQL Code to create data for this problem:

CREATE TABLE MY_TABLE (
    id INT,
    gender CHAR(1),
    country VARCHAR(50),
    favorite_color VARCHAR(50),
    disease CHAR(1),
    height DECIMAL(9, 4),
    weight DECIMAL(9, 5)
);

INSERT INTO MY_TABLE (id, gender, country, favorite_color, disease, height, weight) VALUES (1, 'm', 'mexico', 'blue', 'n', 193.6547, 62.74102); INSERT INTO MY_TABLE (id, gender, country, favorite_color, disease, height, weight) VALUES (2, 'm', 'canada', 'red', 'n', 159.4800, 98.77469); INSERT INTO MY_TABLE (id, gender, country, favorite_color, disease, height, weight) VALUES (3, 'm', 'usa', 'green', 'y', 186.4446, 81.48848); INSERT INTO MY_TABLE (id, gender, country, favorite_color, disease, height, weight) VALUES (4, 'f', 'usa', 'green', 'y', 180.7724, 81.07389); INSERT INTO MY_TABLE (id, gender, country, favorite_color, disease, height, weight) VALUES (5, 'm', 'mexico', 'green', 'n', 163.0572, 88.69809); INSERT INTO MY_TABLE (id, gender, country, favorite_color, disease, height, weight) VALUES (6, 'f', 'usa', 'green', 'y', 199.8869, 71.45501); INSERT INTO MY_TABLE (id, gender, country, favorite_color, disease, height, weight) VALUES (7, 'f', 'mexico', 'red', 'y', 185.2462, 97.55587); INSERT INTO MY_TABLE (id, gender, country, favorite_color, disease, height, weight) VALUES (8, 'f', 'canada', 'red', 'y', 158.2372, 77.69315); INSERT INTO MY_TABLE (id, gender, country, favorite_color, disease, height, weight) VALUES (9, 'm', 'mexico', 'blue', 'y', 193.5437, 91.10319); INSERT INTO MY_TABLE (id, gender, country, favorite_color, disease, height, weight) VALUES (10, 'm', 'usa', 'blue', 'y', 187.6475, 66.67750); INSERT INTO MY_TABLE (id, gender, country, favorite_color, disease, height, weight) VALUES (11, 'f', 'mexico', 'red', 'y', 173.8944, 84.64233); INSERT INTO MY_TABLE (id, gender, country, favorite_color, disease, height, weight) VALUES (12, 'f', 'usa', 'blue', 'y', 162.8618, 70.73499); INSERT INTO MY_TABLE (id, gender, country, favorite_color, disease, height, weight) VALUES (13, 'f', 'canada', 'red', 'y', 151.8939, 63.65442); INSERT INTO MY_TABLE (id, gender, country, favorite_color, disease, height, weight) VALUES (14, 'm', 'mexico', 'blue', 'y', 188.8348, 62.40908); INSERT INTO MY_TABLE (id, gender, country, favorite_color, disease, height, weight) VALUES (15, 'f', 'usa', 'red', 'y', 155.6472, 71.84554);

stats_noob
  • 143
  • 7
  • 1
    Addressing the "why" part, within a group of a single height_ntile, min & max height are fixed. But by adding weight_ntile as an additional grouping criterion you are getting subgroups of height_ntile, and each subgroup has its own min & max of height. At a guess, you could try throwing PARTITION BY at your MIN/MAX(height) with an appropriate list of columns to get fixed values of min/max per height_ntile (regarding the "how" part). – Andriy M Jul 09 '23 at 19:39
  • @ akina: thank you for your reply! I will add a statement to create the dataset as suggested per tip #5 – stats_noob Jul 09 '23 at 20:21
  • @ Andriy M; Thank you for your reply! Do you think you can please show me what you meant if you have time later? Btw your previous answer disappeared from my earlier question :( – stats_noob Jul 09 '23 at 20:28
  • can you try this https://gist.github.com/TheRockStarDBA/cf57f71e3be9b71ba4ac33c415f40c6f – Kin Shah Jul 09 '23 at 20:29
  • @ kin shah: thank you so much for your reply! can you please tell me how you think your logic might address the problems I am facing? thank you so much! – stats_noob Jul 09 '23 at 20:36
  • Does it answer your question ? or you still having issues ? if yes, I can post an answer later. – Kin Shah Jul 09 '23 at 21:02
  • @ Kin Shah: I think it might have answered my question? I am still going through your logic and trying to understand ... I have been working on this problem for a few days now and everything is disorganized/confused in my head... thank you so much for all your help! – stats_noob Jul 09 '23 at 21:28
  • I deleted my answer because after rethinking it, my suggestion seemed irrelevant to the actual problem. – Andriy M Jul 10 '23 at 10:24
  • On a different note, the way the notification system works on this site, you need to put the nickname of a person you are replying to immediately after the @ (without a space in-between). Otherwise they won't be notified. Just FYI. – Andriy M Jul 10 '23 at 10:28

1 Answers1

2

The root of this issue is that NTILE function will break your data up into approximate quantiles. When the number of rows in your dataset isn’t perfectly divisible by the number of buckets, some buckets may end up slightly larger or smaller. NTILE doesn't aim to balance the ranges within these generated tiles (buckets), instead, it's function is just to assure the number of records in each tile is approximately similar. moving MIN(height), MAX(height) to a separate query will solve your issue where you calculate the min/max height values only ONCE for each height_ntile, and then join it back to your main ntiles2 table.

the full query is at https://gist.github.com/TheRockStarDBA/cf57f71e3be9b71ba4ac33c415f40c6f

Kin Shah
  • 62,225
  • 6
  • 120
  • 236