Set non-negative floor for rolling sum, in PostgreSQL

Question

This is a really fun question (asked for SQL Server) and I wanted to try it out to see how it was done in PostgreSQL. Let's see if anyone else can do it better. Taking this data,

CREATE TABLE foo
AS
  SELECT pkid::int, numvalue::int, groupid::int
  FROM ( VALUES
    ( 1,  -1   , 1 ),
    ( 2,  -2   , 1 ),
    ( 3,  5    , 1 ),
    ( 4,  -7   , 1 ),
    ( 5,  1    , 2 )
  ) AS t(pkid, numvalue, groupid);

We're trying to generate this:

PKID   RollingSum    GroupID
----------------------------- ## Explanation: 
1      0             1        ## 0 - 1 < 0  => 0
2      0             1        ## 0 - 2 < 0  => 0
3      5             1        ## 0 + 5 > 0  => 5
4      0             1        ## 5 - 7 < 0  => 0

The problem is described as,

When adding a negative number will cause the sum to be negative, the limit will be activated to set the result as zero. Subsequent addition should be based on this adjusted value, instead of the original rolling sum.

The expected result should be achieved using addition. If the fourth number changes from -7 to -3, the fourth result should be 2 instead of 0

If a single sum can be provided rather than a few rolling numbers, it would also be acceptable. I can use stored procedures to implement a non-negative addition, but that would be too low-level.

The real life problem for this is that we record order placed as positive amount and cancelled as negative. Due to connectivity issues customers may click the cancel button more than once, which will result in multiple negative values being recorded. When calculating our revenue, "zero" need to be a boundary for sales.

Their solutions are all using recursion.

Some tests: http://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=d41d691877d7c66311af7182d64bd743 — Abelisto, Jul 10 '17 at 22:48

dnoeth · Accepted Answer · 2017-07-09T19:53:05.940

11

This is how I solved a similar problem on Teradata using nested OLAP-functions:

SELECT dt.*,
   -- find the lowest previous CumSum < 0       
   -- and adjust the current CumSum to zero
   Max(CASE WHEN CumSum < 0 THEN -CumSum ELSE 0 end)
       Over (PARTITION BY groupid
             ORDER BY pkid
             ROWS Unbounded Preceding)
   + CumSum AS AdjustedSum
FROM 
 ( 
   SELECT pkid, numvalue, groupid,
      -- calculate a standard cumulative sum
      Sum(numvalue)
      Over (PARTITION BY groupid
            ORDER BY pkid
            ROWS Unbounded Preceding) AS CumSum
   FROM foo
 ) AS dt

edited Jul 09 '17 at 19:53

answered Jul 09 '17 at 19:33

dnoeth

4,196
11
14

Clever! (but probably not efficient) – joanolo Jul 09 '17 at 20:12
1

I'm just now looking at this, and I find myself having to experiment with it a bit, you should have explained a little. Cool method anyway. =) – Evan Carroll Jul 09 '17 at 20:16
2

@joanolo: Probably more efficient than the recursive solutions for SQL Server (at least way more efficient on Teradata). A custom Windowed Aggregate function would be more efficient on Teradata, too, but must be coded in C :-) – dnoeth Jul 09 '17 at 20:54
1

This is about 5X faster on my machine than recursion in SQL Server. – Joe Obbish Jul 10 '17 at 03:31
With PKID being unique, is there a need to do ROWS Unbounded Preceding? – Evan Carroll Jul 10 '17 at 14:30
@EvanCarroll: No, but the default is RANGE UNBOUNDED PRECEDING and this is way harder to calculate than ROWS (don't know if the optimizer is smart enough to change that for a unique order) – dnoeth Jul 10 '17 at 14:36
This is probably the second most difficult query to understand thus far. It seems to share a special property in complexity with https://dba.stackexchange.com/q/167068/2639 which also took me a while to grok. – Evan Carroll Jul 10 '17 at 23:06
@EvanCarroll: There's probably (simple?) mathematical explanation for this, but I'm not a math guy :-) I just spotted that years ago while drawing a chart to visualize a similar problem. – dnoeth Jul 11 '17 at 06:30
@EvanCarroll: Regarding the gaps&island problem: I'm a trainer for Teradata and a simplified version is one of my "optional" labs. Within the last 15 years there were very few to find that solution (usually those who never worked with OLAP-functions before), most (if at all) did that compare to the previous row & create a 0/1 flag based and then do a cumulative sum over the flag to create groups. As there's no clear winner regarding performance, because both solutions need two OLAP-calculations, I usually prefer the flag version... – dnoeth Jul 11 '17 at 06:30
Very smart indeed. I added a variant that should be a bit faster on Teradata as well. And an explanation. The logic is beautifully simple, but I find it hard to put it in decently short English sentences. – Erwin Brandstetter Jul 15 '17 at 02:10
@ErwinBrandstetter: Thanks for explaining the logic. – dnoeth Jul 15 '17 at 10:44
In my case I used a SUM instead of MAX, as I needed the cummulative sum to ignore negative numbers. The SUM adds back the subtracted values from the first cummulative sum – localhost Dec 10 '21 at 01:12

Evan Carroll · Answer 2 · 2017-07-08T20:48:00.960

Using a Custom Aggregate function

We use CREATE FUNCTION to create a function int_add_pos_or_zero that adds numbers, but if they're less than 0, returns 0.

CREATE FUNCTION int_add_pos_or_zero(int, int)
RETURNS int
AS $$
  BEGIN
    RETURN greatest($1 + $2, 0);
  END;
$$
LANGUAGE plpgsql
IMMUTABLE;

Now we CREATE AGGREGATE on that so we can run it in a window function. We set INITCOND to be =0.

CREATE AGGREGATE add_pos_or_zero(int) (
  SFUNC = int_add_pos_or_zero,
  STYPE = int,
  INITCOND = 0
);

Now we query it like any other Window Function.

SELECT pkid,
  groupid,
  numvalue,
  add_pos_or_zero(numvalue) OVER (PARTITION BY groupid ORDER BY pkid)
FROM foo;
 pkid | groupid | numvalue | add_pos_or_zero 
------+---------+----------+-----------------
    1 |       1 |       -1 |               0
    2 |       1 |       -2 |               0
    3 |       1 |        5 |               5
    4 |       1 |       -7 |               0
    5 |       2 |        1 |               1
(5 rows)

Erwin Brandstetter · Answer 3 · 2017-07-16T17:21:32.977

Query sith window functions

This is much like dnoeth's smart query (same basic logic). Just slightly shorter and more efficient with a simpler expression in the outer query:

SELECT groupid, pkid
     , simple_sum
     - LEAST(MIN(simple_sum)
       OVER (PARTITION BY groupid
             ORDER BY pkid ROWS UNBOUNDED PRECEDING), 0) AS rolling_sum
FROM ( 
   SELECT pkid, numvalue, groupid
        , SUM(numvalue) OVER (PARTITION BY groupid
                              ORDER BY pkid ROWS UNBOUNDED PRECEDING) AS simple_sum
   FROM   foo
   ) sub;

How does it work?
To compute the special rolling sum as requested, for every row where the simple rolling sum would turn out negative, we add the same positive number to make it zero instead. That's precisely what the calculation in the outer SELECT does, subtracting negative numbers adds the corresponding positive:

LEAST(MIN(simple_sum) OVER (PARTITION BY groupid
                            ORDER BY pkid ROWS UNBOUNDED PRECEDING), 0)

The surrounding LEAST cancels any action for positive numbers (or 0). The least negative (greatest absolute number) of the simple running sum is what we have to add so far in total. Every time our calculation would go below zero, we get a new absolute low in the simple running sum. It's all beautifully simple.

PL/pgSQL functions

Based off Abelisto's implementation, improved:

CREATE OR REPLACE FUNCTION f_special_rolling_sum()
  RETURNS TABLE (groupid int, pkid int, numvalue int, rolling_sum int) AS
$func$
DECLARE
   last_groupid int;
BEGIN
   FOR groupid, pkid, numvalue IN 
      SELECT f.groupid, f.pkid, f.numvalue
      FROM   foo f
      ORDER  BY f.groupid, f.pkid
   LOOP
      IF last_groupid = groupid THEN  -- same partition continues
         rolling_sum := GREATEST(rolling_sum + numvalue, 0);
      ELSE                            -- new partition
         last_groupid := groupid;
         rolling_sum  := GREATEST(numvalue, 0);
      END IF;
      RETURN NEXT;
   END LOOP;
END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM f_special_rolling_sum();

Index

All solutions supplied so far can benefit from an index-only scan with a covering index:

CREATE INDEX idx_foo_covering ON foo(groupid, pkid, numvalue);

Slow index scans in large table

Tests

After optimizing function, queries, index (and the test itself) I see similar performance for both, the query being slightly faster than the function. (The aggregate function a bit slower than the rest.) Extensive test suite (based off Abelisto's fiddle):

dbfiddle for pg 9.6 here

dbfiddle for pg 10 here

score 1 · Answer 4 · edited Jul 14 '17 at 07:08

1

Well, this is ugly, but it works without adding any new function or aggregate:

SELECT *, 
  CASE
    WHEN numvalue > 0 
    THEN sum( greatest(numvalue,0) ) OVER (PARTITION BY groupid ORDER BY pkid)
    ELSE 0 
  END AS result
FROM foo;

edited Jul 14 '17 at 07:08

Evan Carroll

63,051
46
242
479

answered Jul 14 '17 at 07:02

Fulano da Silva Sauro

11
1

I upvoted this, but this solution doesn't work. Change pkid 4's numvalue to -2 in the solution above. 5-2 should be three, you catch numvalue being -2 and set the result to 0. – Evan Carroll Jul 14 '17 at 07:25
Indeed. My bad! Anyway, nice problem :) The aggregate solution is pretty neat. – Fulano da Silva Sauro Jul 15 '17 at 08:20