Aggregate query for pairs of arrays of events

Question

db<>fiddle for all of the data and queries below

I have a table events with the following structure:

create table events (
    correlation_id char(26) not null,
    user_id        bigint,
    task_id        bigint not null,
    location_id    bigint,
    type           bigint not null,
    created_at     timestamp(6) with time zone not null,
    constraint events_correlation_id_created_at_user_id_unique
        unique (correlation_id, created_at, user_id)
);

This table holds records of tasks being performed, that look like this:

correlation_id	user_id	task_id	location_id	type	created_at
01CN4HP4AN0000000000000001	4	58	30	0	2018-08-17 18:17:15.348629
01CN4HP4AN0000000000000001	4	58	30	1	2018-08-17 18:17:22.852299
01CN4HP4AN0000000000000001	4	58	30	99	2018-08-17 18:17:25.535593
01CN4J9SZ80000000000000003	4	97	30	0	2018-08-17 18:28:00.104093
01CN4J9SZ80000000000000003	4	97	30	99	2018-08-17 18:29:09.016840
01CN4JC1430000000000000004	4	99	30	0	2018-08-17 18:29:12.963264
01CN4JC1430000000000000004	4	99	30	99	2018-08-17 18:32:09.272632
01CN4KJCDY0000000000000005	139	97	30	0	2018-08-17 18:50:09.725668
01CN4KJCDY0000000000000005	139	97	30	3	2018-08-17 18:50:11.842000
01CN4KJCDY0000000000000005	139	97	30	99	2018-08-17 18:51:42.240895
01CNC4G1Y40000000000000008	139	99	30	0	2018-08-20 17:00:40.260430
01CNC4G1Y40000000000000008	139	99	30	99	2018-08-20 17:00:47.583501

Rows with type = 0 indicate the start of a task, and rows with type = 99 indicate the end of a task. (Other values mean other things that are not relevant for this question, but two example rows are included here for completeness.)

Each task_id corresponds to a row from a tasks table. The only other field in the tasks table that is relevant to this question is called inprogress_status, and it can be 1 or 2, which represent Opening task and Closing task respectively.

I was originally asked for a query that would return the list of tasks, ordered by start date and location, with a single row that includes the start (type = 0) and end (type = 99) for each task.

Here is the query I used to do that:

SELECT e.created_at::DATE, e.location_id, e.task_id
     , CASE t.inprogress_status WHEN 2 THEN 'CLOSE' WHEN 1 THEN 'OPEN' END AS task_type
     , e.correlation_id
     , json_object_agg(e.type, json_build_object('timestamp', e.created_at, 'user_id', e.user_id)) AS events
FROM events e
JOIN tasks t on e.task_id = t.id
WHERE e.type IN (0, 99)
AND t.inprogress_status IN (1, 2)
group by created_at::DATE, location_id, task_id, correlation_id, inprogress_status
ORDER BY 1, 2, 3;

Here is the result for that query using the data shown above:

created_at	location_id	task_id	task_type	correlation_id	events
2018-08-17	30	58	OPEN	01CN4HP4AN0000000000000001	`{"0": {"timestamp": "2018-08-17T18:17:15.348629+00:00", "user_id": 4}, "99": {"timestamp": "2018-08-17T18:17:25.535593+00:00", "user_id": 4} }`
2018-08-17	30	97	CLOSE	01CN4J9SZ80000000000000003	`{"0": {"timestamp": "2018-08-17T18:28:00.104093+00:00", "user_id": 4}, "99": {"timestamp": "2018-08-17T18:29:09.01684+00:00", "user_id": 4} }`
2018-08-17	30	99	OPEN	01CN4JC1430000000000000004	`{ "0": {"timestamp": "2018-08-17T18:29:12.963264+00:00", "user_id": 4}, "99": {"timestamp": "2018-08-17T18:32:09.272632+00:00", "user_id": 4} }`
2018-08-17	30	97	CLOSE	01CN4KJCDY0000000000000005	`{ "0": {"timestamp": "2018-08-17T18:50:09.725668+00:00", "user_id": 139}, "99": {"timestamp": "2018-08-17T18:51:42.240895+00:00", "user_id": 139} }`
2018-08-20	30	99	OPEN	01CNC4G1Y40000000000000008	`{ "0": {"timestamp": "2018-08-20T17:00:40.26043+00:00", "user_id": 139}, "99" : {"timestamp": "2018-08-20T17:00:47.583501+00:00", "user_id" : 139} }`

In the above example, task_id 58 and 99 have inprogress_status = 1 and task_id 97 has inprogress_status = 2.

Now I have been asked to modify the returned data structure so that it aggregates by the inprogress_status as well, and returns the rows as pairs of OPEN+CLOSE events.

To try to figure out how to build this, I started by trying to get this format (the final format I actually want is below):

created_at	location_id	events
2018-08-17	30	{"OPEN": [{"correlation_id": "01CN4HP4AN0000000000000001", "0" : {"timestamp" : "2018-08-17T18:17:15.348629+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:17:25.535593+00:00", "user_id" : 4} }, {"OPEN": {"correlation_id": "01CN4JC1430000000000000004", "0" : {"timestamp" : "2018-08-17T18:29:12.963264+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:32:09.272632+00:00", "user_id" : 4} }], "CLOSE": [{"correlation_id": "01CN4J9SZ80000000000000003", "0" : {"timestamp" : "2018-08-17T18:28:00.104093+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:29:09.01684+00:00", "user_id" : 4} }, { "correlation_id": "01CN4KJCDY0000000000000005", "0" : {"timestamp" : "2018-08-17T18:50:09.725668+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-17T18:51:42.240895+00:00", "user_id" : 139} }]}
2018-08-20	30	`{"OPEN": [{"correlation_id": "01CNC4G1Y40000000000000008", "0" : {"timestamp" : "2018-08-20T17:00:40.26043+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-20T17:00:47.583501+00:00", "user_id" : 139} }], "CLOSE": null}`

Here is the first query I wrote to try to make this work:

WITH grouped_events AS (
    SELECT e.created_at::DATE AS created_date,
        location_id,
        task_id,
        CASE t.inprogress_status WHEN 2 THEN 'CLOSE' WHEN 1 THEN 'OPEN' END AS task_type,
        jsonb_build_object('id', e.correlation_id) ||
                jsonb_object_agg(type, jsonb_build_object('timestamp', e.created_at, 'user_id', user_id)) AS events
    FROM events e
    JOIN tasks t on e.task_id = t.id
    WHERE type IN (0, 99)
    AND inprogress_status IN (1, 2)
    GROUP BY e.created_at::DATE, location_id, task_id, correlation_id, t.inprogress_status
)
SELECT created_date, location_id, json_object_agg(task_type, events)
FROM grouped_events
GROUP BY 1, 2
ORDER BY 1, 2

The problem is that this produces invalid JSON. with multiple identical keys:

{
    "OPEN": {
        "0": { "user_id": 4, "timestamp": "2018-08-17T18:29:12.963264+00:00" },
        "99": { "user_id": 4, "timestamp": "2018-08-17T18:32:09.272632+00:00" },
        "id": "01CN4JC1430000000000000004"
    },
    "OPEN": {
        "0": { "user_id": 4, "timestamp": "2018-08-17T18:17:15.348629+00:00" },
        "99": { "user_id": 4, "timestamp": "2018-08-17T18:17:25.535593+00:00" },
        "id": "01CN4HP4AN0000000000000001"
    },
    // ... etc.
}

I found that this query returns the data in the format show above:

WITH grouped_events1 AS (
    SELECT e.created_at::DATE AS created_date,
        location_id,
        task_id,
        CASE t.inprogress_status WHEN 2 THEN 'CLOSE' WHEN 1 THEN 'OPEN' END AS task_type,
        jsonb_build_object('id', e.correlation_id) ||
                jsonb_object_agg(type, jsonb_build_object('timestamp', e.created_at, 'user_id', user_id)) AS events
    FROM events e
    JOIN tasks t on e.task_id = t.id
    WHERE type IN (0, 99)
    AND inprogress_status IN (1, 2)
    GROUP BY e.created_at::DATE, location_id, task_id, correlation_id, t.inprogress_status
), grouped_events2 AS (
    SELECT created_date, location_id, task_type, json_agg(events) AS events
    FROM grouped_events1
    GROUP BY 1, 2, 3
)
SELECT created_date, location_id, json_object_agg(task_type, events)
FROM grouped_events2
GROUP BY 1, 2
ORDER BY 1, 2

However, the format I actually need should just pair a single OPEN with a single CLOSE, like this (each OPEN with the CLOSE that follows it in time):

created_at	location_id	events
2018-08-17	30	`{"OPEN": {"correlation_id": "01CN4HP4AN0000000000000001", "0" : {"timestamp" : "2018-08-17T18:17:15.348629+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:17:25.535593+00:00", "user_id" : 4} }, "CLOSE": {"correlation_id": "01CN4J9SZ80000000000000003", "0" : {"timestamp" : "2018-08-17T18:28:00.104093+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:29:09.01684+00:00", "user_id" : 4} }}`
2018-08-17	30	`{"OPEN": {"OPEN": {"correlation_id": "01CN4JC1430000000000000004", "0" : {"timestamp" : "2018-08-17T18:29:12.963264+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:32:09.272632+00:00", "user_id" : 4} }, "CLOSE": { "correlation_id": "01CN4KJCDY0000000000000005", "0" : {"timestamp" : "2018-08-17T18:50:09.725668+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-17T18:51:42.240895+00:00", "user_id" : 139} }}`
2018-08-20	30	`{"OPEN": [{"correlation_id": "01CNC4G1Y40000000000000008", "0" : {"timestamp" : "2018-08-20T17:00:40.26043+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-20T17:00:47.583501+00:00", "user_id" : 139} }], "CLOSE": null}`

Now I'm trying to figure out if I'm heading the wrong direction, because I can't see how to get to my final format from what I have.

Am I approaching this wrong? How can I get the result I'm looking for?

It would seem that each task per event is held together by the same correlation_id and, within, the start of a task (type = 0) always has min(created_at), and end of task (type = 99) always max(created_at). Is that so? — Erwin Brandstetter, Aug 09 '22 at 03:54
@ErwinBrandstetter Correct about correlation_id - the ID is generated by the client when the first events row is inserted at the start of the task performance, and it is used for all of the events related to that performance of the task to tie them together. We cannot rely on min(created_at) being type = 0 and max(created_at) being type = 99, even though that is true for all of the examples I included here. These just happened to be the samples that I had in front of me at the moment. — Moshe Katz, Aug 09 '22 at 04:01
Two reasons min and max cannot be used to determine type: 1 - some types of tasks have other events that happen before or after the "official" start and end time of the task; 2 - if the task is still in progress there won't be a 99 yet. I can add more examples if that will help. — Moshe Katz, Aug 09 '22 at 04:02
Your desired result indicates that after every "OPEN" event there is a "CLOSE" event. Can we rely on that and, if not, how to deal with disruptions in that pattern? — Erwin Brandstetter, Aug 09 '22 at 04:05
@ErwinBrandstetter For now we can assume that there will always be pairs of OPEN+CLOSE, with two exceptions: 1 - the first task of the day might be a CLOSE for yesterday's OPEN, in which case no OPEN will be shown; 2 - the last task of the day might be an OPEN because the corresponding CLOSE hasn't been done yet. — Moshe Katz, Aug 09 '22 at 04:13
"We cannot rely on min(created_at) being type = 0" - Can we still use min(created_at) to determine the flow of tasks - or do we have to use created_at of the row with type = 0? Can time ranges overlap? — Erwin Brandstetter, Aug 09 '22 at 04:16

Erwin Brandstetter · Accepted Answer · 2022-08-09T05:14:23.967

This produces your desired result:

SELECT the_day, location_id
     , jsonb_object_agg(task_type, events || jsonb_build_object('correlation_id', correlation_id)) AS events
FROM  (
   SELECT e.created_at::date AS the_day, e.location_id, e.correlation_id
        , count(*) FILTER (WHERE t.inprogress_status = 1)
                   OVER (PARTITION BY e.location_id ORDER BY min(e.created_at) FILTER (WHERE e.type = 0)) AS task_nr
        , CASE t.inprogress_status WHEN 2 THEN 'CLOSE' WHEN 1 THEN 'OPEN' END AS task_type     
        , jsonb_object_agg(e.type, jsonb_build_object('timestamp', e.created_at, 'user_id', e.user_id)) AS events
   FROM   events e
   JOIN   tasks t on e.task_id = t.id
   WHERE  e.type IN (0, 99)
   AND    t.inprogress_status IN (1, 2)
   GROUP  BY 1, 2, e.correlation_id, t.inprogress_status
   ) sub
GROUP  BY the_day, location_id, task_nr
ORDER  BY the_day, location_id, task_nr;

db<>fiddle here

Except that missing 'OPEN' events at the start of the day and missing 'CLOSE' events at the end are just missing.

I use jsonb instead of json to allow the jsonb || jsonb operator. You can just cast the result to json, if you actually need that.

Core feature is this sophisticated expression to form task numbers:

    , count(*) FILTER (WHERE t.inprogress_status = 1)
               OVER (PARTITION BY e.location_id ORDER BY min(e.created_at) FILTER (WHERE e.type = 0)) AS task_nr

Every 'OPEN' task starts a new group. The created_at with type = 0 defines the sequence of tasks. Technically, this works because we can nest aggregate function (even with the aggregate FILTER clause) in window functions.
Related answers:

Aggregate query for pairs of arrays of events

1 Answers1