0

Is there Hypothetical-Set Aggregate Function equivalent to ntile (or some other good solution) in Postgres?

I have this query:

select
    frctl
    ,*
from
    (select 
        *
     from
            d_al
     where
        not rtn is null
        and not fund_val is null
     ) dx
    ,lateral(
        select
            round(percent_rank(dx.fund_val) WITHIN GROUP (ORDER BY fund_val)::numeric
              , 6) AS frctl
        from 
            d_al
        where
            gen_qtr_end_dt <= dx.gen_qtr_end_dt
            and not rtn is null
            and not fund_val is null
    ) x
order by gen_qtr_end_dt_id, frctl

The query produces periodic historical percentile ranks. Ranking a value in a certain period/date relative to the current period/date plus all the historical periods/dates (periods before it) time series-wise/chronologically.

It works perfectly, except I want fractiles (i.e. the option to create deciles, quartiles, etc.) like ntile(#) does naturally. Do I have to build a case statement to fit the fractiles I want? For example, if I want ntile(4) (quartiles), do I have to build a case statement based off of 0, 0.25, 0.5,0.75,1. Then if I want ntile(10) (deciles), do I have to build a case statement based off of 0, 0.1, 0.2,0.3,0.4 ... etc? Or is there an ntile type Hypothetical-Set Aggregate Function I am missing?

Helpful links: https://www.postgresql.org/docs/current/static/functions-aggregate.html#FUNCTIONS-HYPOTHETICAL-TABLE

Percentile rank that takes sorted argument (or same functionality) in PostgreSQL 9.3.5 (In the link directly above the problem is a bit different, but very related.)

The data:

  • Big - efficiency is important, but not the focus of my question.
  • Table d_al has three columns, two matter here:
  • gen_qtr_end_dt - not unique, not null, type date
  • fund_val - can be null, type numeric
  • rtn - can be null, type numeric, not important here

I have Postgres 9.6.

PS - this query does all of the history, but my next step is to do a number of days rolling period look back (rather than all of the history).

edit 1: Here is how I am solving it now (with a case statement as mentioned):

I put the above query in an cte then...

   with pl as (
    select
        x.pctl
        ,dx.fund_val
        , dx.rtn
        ,dx.gen_qtr_end_dt
    from
        (select 
            *
         from
                d_al
         where
            not rtn is null
            and not fund_val is null
         ) dx
        ,lateral(
            select
                round(percent_rank(dx.fund_val) WITHIN GROUP (ORDER BY fund_val)::numeric
                  , 6) AS pctl
            from 
                d_al
            where
                gen_qtr_end_dt <= dx.gen_qtr_end_dt
                and not rtn is null
                and not fund_val is null
        ) x
)
-- , f as( 
    select 
        gen_qtr_end_dt_id
        ,case   when pl.pctl < 0.1 then 1
                when pl.pctl < 0.2 then 2
                when pl.pctl < 0.3 then 3
                when pl.pctl < 0.4 then 4
                when pl.pctl < 0.5 then 5
                when pl.pctl < 0.6 then 6
                when pl.pctl < 0.7 then 7
                when pl.pctl < 0.8 then 8
                when pl.pctl < 0.9 then 9
                else 10
         end
            frctl 
        ,rtn
        ,fund_val
        ,*
    from 
        pl
    order by
        gen_qtr_end_dt, frctl

...which is a bit cumbersome/rigid but doable if need be.

edit 2: And here is a sample of the output from edit 1 above:

frctl   fund_val    pctl    gen_qtr_end_dt
1   -14.514 0   3/31/2001
2   -8.618  0.142857    3/31/2001
3   1.707   0.285714    3/31/2001
5   26.162  0.428571    3/31/2001
6   141.873 0.571429    3/31/2001
8   216 0.714286    3/31/2001
9   254 0.857143    3/31/2001
1   -15.237 0.071429    6/30/2001
1   -32 0   6/30/2001
3   -6.949  0.285714    6/30/2001
5   6.307   0.428571    6/30/2001
6   28.542  0.571429    6/30/2001
7   140.816 0.642857    6/30/2001
9   239 0.857143    6/30/2001
1   -47 0.043478    9/30/2001
1   -63.367 0   9/30/2001
2   -16.599 0.130435    9/30/2001
4   -6.087  0.347826    9/30/2001
6   31.425  0.565217    9/30/2001
7   47.137  0.608696    9/30/2001
8   150.678 0.73913 9/30/2001
8   200 0.782609    9/30/2001
10  1902.684    0.956522    9/30/2001
1   -246.545    0   12/31/2001
2   -18.731 0.125   12/31/2001
4   -0.043  0.375   12/31/2001
4   -6  0.34375 12/31/2001
5   9.285   0.46875 12/31/2001
6   43.519  0.59375 12/31/2001
7   111 0.65625 12/31/2001
8   154.573 0.78125 12/31/2001
10  1017.514    0.9375  12/31/2001
1   -23.678 0.095238    3/31/2002
4   2.229   0.357143    3/31/2002
5   14  0.428571    3/31/2002
5   17.689  0.452381    3/31/2002
6   67.245  0.595238    3/31/2002
7   130.604 0.642857    3/31/2002
8   156 0.761905    3/31/2002
8   179.399 0.785714    3/31/2002
9   213.756 0.833333    3/31/2002
10  855.2   0.928571    3/31/2002
1   -26.536 0.076923    6/30/2002
3   1.295   0.288462    6/30/2002
4   9   0.365385    6/30/2002
5   16.714  0.423077    6/30/2002
6   64.547  0.557692    6/30/2002
6   103.539 0.596154    6/30/2002
8   181.284 0.769231    6/30/2002
9   203 0.807692    6/30/2002
10  600.194 0.923077    6/30/2002
10  284.306 0.903846    6/30/2002
1   -85 0.016129    9/30/2002
1   -25.475 0.096774    9/30/2002
2   -20.394 0.129032    9/30/2002
4   2.551   0.33871 9/30/2002
6   102.395 0.564516    9/30/2002
7   113.453 0.612903    9/30/2002
8   168.205 0.725806    9/30/2002
9   248 0.854839    9/30/2002
10  800.551 0.935484    9/30/2002
10  460.067 0.903226    9/30/2002

edit 3: As it stands, the way I am doing it here is so slow it is unusable. The slow part is the query with percent_rank() in it.

mountainclimber11
  • 727
  • 1
  • 9
  • 23
  • 1
    So what's wrong with the ntile() function? –  Jun 29 '17 at 20:10
  • I don't see how it could be implemented in this context, but I hope I am wrong. I clearly don't know a ton about this, but I don't know how to use ntile() using two different sets of data (dx.fund_val and fund_val) in the same function like percent_rank() can. And I assume that is necessary... – mountainclimber11 Jun 29 '17 at 20:27
  • 1
    @mountainclimber your application requirements are not obvious from the question. It would really help if you provided a short sample input and sample output for it – filiprem Jun 29 '17 at 21:52
  • @filiprem doesn't edit 2 have all of that? frctl is my desired output and fund_val and gen_qtr_end_dt are my applicable inputs. Is there something else you need? – mountainclimber11 Jun 29 '17 at 22:01
  • @mountainclimber what we want is basic schema, basic data, and a desired output. We prefer it to be separate from the complexities of your real world example. But, at the very least, we need sample data. – Evan Carroll Jun 29 '17 at 22:04
  • @EvanCarroll The "table" d_al isn't actually a table, it is a CTE, so no schema to be had there unless I open up a huge can of worms that would distract from the focus of my question. Doesn't my edit 2 have basic input and output data? See my comment to filiprem. Or are you saying you want them separated? – mountainclimber11 Jun 30 '17 at 13:07

1 Answers1

1

Try trunc(10 * pl.pctl) + 1, but as percent_rank returns 0 <= n <= 1 the maximum value will be 11 instead of 10.

CUME_DIST is quite similar to PERCENT_RANK but returns 0 < n <= 1, thus you might switch to 1-cume_dist... (ORDER BY fund_val DESC) for the pctl calculation.

dnoeth
  • 4,196
  • 11
  • 14
  • I think in my case since I am always comparing the value to itself, as well as all its current period and previous periods, (gen_qtr_end_dt <= dx.gen_qtr_end_dt, note the =) I won't get a 1 out of percent_rank(); therefore, the 11 issue isn't really an issue for me. At least that is what I am seeing in my results (no 1s). – mountainclimber11 Jun 30 '17 at 14:18