db<>fiddle for all of the data and queries below
I have a table events with the following structure:
create table events (
correlation_id char(26) not null,
user_id bigint,
task_id bigint not null,
location_id bigint,
type bigint not null,
created_at timestamp(6) with time zone not null,
constraint events_correlation_id_created_at_user_id_unique
unique (correlation_id, created_at, user_id)
);
This table holds records of tasks being performed, that look like this:
| correlation_id | user_id | task_id | location_id | type | created_at |
|---|---|---|---|---|---|
| 01CN4HP4AN0000000000000001 | 4 | 58 | 30 | 0 | 2018-08-17 18:17:15.348629 |
| 01CN4HP4AN0000000000000001 | 4 | 58 | 30 | 1 | 2018-08-17 18:17:22.852299 |
| 01CN4HP4AN0000000000000001 | 4 | 58 | 30 | 99 | 2018-08-17 18:17:25.535593 |
| 01CN4J9SZ80000000000000003 | 4 | 97 | 30 | 0 | 2018-08-17 18:28:00.104093 |
| 01CN4J9SZ80000000000000003 | 4 | 97 | 30 | 99 | 2018-08-17 18:29:09.016840 |
| 01CN4JC1430000000000000004 | 4 | 99 | 30 | 0 | 2018-08-17 18:29:12.963264 |
| 01CN4JC1430000000000000004 | 4 | 99 | 30 | 99 | 2018-08-17 18:32:09.272632 |
| 01CN4KJCDY0000000000000005 | 139 | 97 | 30 | 0 | 2018-08-17 18:50:09.725668 |
| 01CN4KJCDY0000000000000005 | 139 | 97 | 30 | 3 | 2018-08-17 18:50:11.842000 |
| 01CN4KJCDY0000000000000005 | 139 | 97 | 30 | 99 | 2018-08-17 18:51:42.240895 |
| 01CNC4G1Y40000000000000008 | 139 | 99 | 30 | 0 | 2018-08-20 17:00:40.260430 |
| 01CNC4G1Y40000000000000008 | 139 | 99 | 30 | 99 | 2018-08-20 17:00:47.583501 |
Rows with type = 0 indicate the start of a task, and rows with type = 99 indicate the end of a task. (Other values mean other things that are not relevant for this question, but two example rows are included here for completeness.)
Each task_id corresponds to a row from a tasks table. The only other field in the tasks table that is relevant to this question is called inprogress_status, and it can be 1 or 2, which represent Opening task and Closing task respectively.
I was originally asked for a query that would return the list of tasks, ordered by start date and location, with a single row that includes the start (type = 0) and end (type = 99) for each task.
Here is the query I used to do that:
SELECT e.created_at::DATE, e.location_id, e.task_id
, CASE t.inprogress_status WHEN 2 THEN 'CLOSE' WHEN 1 THEN 'OPEN' END AS task_type
, e.correlation_id
, json_object_agg(e.type, json_build_object('timestamp', e.created_at, 'user_id', e.user_id)) AS events
FROM events e
JOIN tasks t on e.task_id = t.id
WHERE e.type IN (0, 99)
AND t.inprogress_status IN (1, 2)
group by created_at::DATE, location_id, task_id, correlation_id, inprogress_status
ORDER BY 1, 2, 3;
Here is the result for that query using the data shown above:
| created_at | location_id | task_id | task_type | correlation_id | events |
|---|---|---|---|---|---|
| 2018-08-17 | 30 | 58 | OPEN | 01CN4HP4AN0000000000000001 | {"0": {"timestamp": "2018-08-17T18:17:15.348629+00:00", "user_id": 4}, "99": {"timestamp": "2018-08-17T18:17:25.535593+00:00", "user_id": 4} } |
| 2018-08-17 | 30 | 97 | CLOSE | 01CN4J9SZ80000000000000003 | {"0": {"timestamp": "2018-08-17T18:28:00.104093+00:00", "user_id": 4}, "99": {"timestamp": "2018-08-17T18:29:09.01684+00:00", "user_id": 4} } |
| 2018-08-17 | 30 | 99 | OPEN | 01CN4JC1430000000000000004 | { "0": {"timestamp": "2018-08-17T18:29:12.963264+00:00", "user_id": 4}, "99": {"timestamp": "2018-08-17T18:32:09.272632+00:00", "user_id": 4} } |
| 2018-08-17 | 30 | 97 | CLOSE | 01CN4KJCDY0000000000000005 | { "0": {"timestamp": "2018-08-17T18:50:09.725668+00:00", "user_id": 139}, "99": {"timestamp": "2018-08-17T18:51:42.240895+00:00", "user_id": 139} } |
| 2018-08-20 | 30 | 99 | OPEN | 01CNC4G1Y40000000000000008 | { "0": {"timestamp": "2018-08-20T17:00:40.26043+00:00", "user_id": 139}, "99" : {"timestamp": "2018-08-20T17:00:47.583501+00:00", "user_id" : 139} } |
In the above example, task_id 58 and 99 have inprogress_status = 1 and task_id 97 has inprogress_status = 2.
Now I have been asked to modify the returned data structure so that it aggregates by the inprogress_status as well, and returns the rows as pairs of OPEN+CLOSE events.
To try to figure out how to build this, I started by trying to get this format (the final format I actually want is below):
| created_at | location_id | events |
|---|---|---|
| 2018-08-17 | 30 | {"OPEN": [{"correlation_id": "01CN4HP4AN0000000000000001", "0" : {"timestamp" : "2018-08-17T18:17:15.348629+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:17:25.535593+00:00", "user_id" : 4} }, {"OPEN": {"correlation_id": "01CN4JC1430000000000000004", "0" : {"timestamp" : "2018-08-17T18:29:12.963264+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:32:09.272632+00:00", "user_id" : 4} }], "CLOSE": [{"correlation_id": "01CN4J9SZ80000000000000003", "0" : {"timestamp" : "2018-08-17T18:28:00.104093+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:29:09.01684+00:00", "user_id" : 4} }, { "correlation_id": "01CN4KJCDY0000000000000005", "0" : {"timestamp" : "2018-08-17T18:50:09.725668+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-17T18:51:42.240895+00:00", "user_id" : 139} }]} |
| 2018-08-20 | 30 | {"OPEN": [{"correlation_id": "01CNC4G1Y40000000000000008", "0" : {"timestamp" : "2018-08-20T17:00:40.26043+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-20T17:00:47.583501+00:00", "user_id" : 139} }], "CLOSE": null} |
Here is the first query I wrote to try to make this work:
WITH grouped_events AS (
SELECT e.created_at::DATE AS created_date,
location_id,
task_id,
CASE t.inprogress_status WHEN 2 THEN 'CLOSE' WHEN 1 THEN 'OPEN' END AS task_type,
jsonb_build_object('id', e.correlation_id) ||
jsonb_object_agg(type, jsonb_build_object('timestamp', e.created_at, 'user_id', user_id)) AS events
FROM events e
JOIN tasks t on e.task_id = t.id
WHERE type IN (0, 99)
AND inprogress_status IN (1, 2)
GROUP BY e.created_at::DATE, location_id, task_id, correlation_id, t.inprogress_status
)
SELECT created_date, location_id, json_object_agg(task_type, events)
FROM grouped_events
GROUP BY 1, 2
ORDER BY 1, 2
The problem is that this produces invalid JSON. with multiple identical keys:
{
"OPEN": {
"0": { "user_id": 4, "timestamp": "2018-08-17T18:29:12.963264+00:00" },
"99": { "user_id": 4, "timestamp": "2018-08-17T18:32:09.272632+00:00" },
"id": "01CN4JC1430000000000000004"
},
"OPEN": {
"0": { "user_id": 4, "timestamp": "2018-08-17T18:17:15.348629+00:00" },
"99": { "user_id": 4, "timestamp": "2018-08-17T18:17:25.535593+00:00" },
"id": "01CN4HP4AN0000000000000001"
},
// ... etc.
}
I found that this query returns the data in the format show above:
WITH grouped_events1 AS (
SELECT e.created_at::DATE AS created_date,
location_id,
task_id,
CASE t.inprogress_status WHEN 2 THEN 'CLOSE' WHEN 1 THEN 'OPEN' END AS task_type,
jsonb_build_object('id', e.correlation_id) ||
jsonb_object_agg(type, jsonb_build_object('timestamp', e.created_at, 'user_id', user_id)) AS events
FROM events e
JOIN tasks t on e.task_id = t.id
WHERE type IN (0, 99)
AND inprogress_status IN (1, 2)
GROUP BY e.created_at::DATE, location_id, task_id, correlation_id, t.inprogress_status
), grouped_events2 AS (
SELECT created_date, location_id, task_type, json_agg(events) AS events
FROM grouped_events1
GROUP BY 1, 2, 3
)
SELECT created_date, location_id, json_object_agg(task_type, events)
FROM grouped_events2
GROUP BY 1, 2
ORDER BY 1, 2
However, the format I actually need should just pair a single OPEN with a single CLOSE, like this (each OPEN with the CLOSE that follows it in time):
| created_at | location_id | events |
|---|---|---|
| 2018-08-17 | 30 | {"OPEN": {"correlation_id": "01CN4HP4AN0000000000000001", "0" : {"timestamp" : "2018-08-17T18:17:15.348629+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:17:25.535593+00:00", "user_id" : 4} }, "CLOSE": {"correlation_id": "01CN4J9SZ80000000000000003", "0" : {"timestamp" : "2018-08-17T18:28:00.104093+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:29:09.01684+00:00", "user_id" : 4} }} |
| 2018-08-17 | 30 | {"OPEN": {"OPEN": {"correlation_id": "01CN4JC1430000000000000004", "0" : {"timestamp" : "2018-08-17T18:29:12.963264+00:00", "user_id" : 4}, "99" : {"timestamp" : "2018-08-17T18:32:09.272632+00:00", "user_id" : 4} }, "CLOSE": { "correlation_id": "01CN4KJCDY0000000000000005", "0" : {"timestamp" : "2018-08-17T18:50:09.725668+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-17T18:51:42.240895+00:00", "user_id" : 139} }} |
| 2018-08-20 | 30 | {"OPEN": [{"correlation_id": "01CNC4G1Y40000000000000008", "0" : {"timestamp" : "2018-08-20T17:00:40.26043+00:00", "user_id" : 139}, "99" : {"timestamp" : "2018-08-20T17:00:47.583501+00:00", "user_id" : 139} }], "CLOSE": null} |
Now I'm trying to figure out if I'm heading the wrong direction, because I can't see how to get to my final format from what I have.
Am I approaching this wrong? How can I get the result I'm looking for?
correlation_idand, within, the start of a task (type = 0) always hasmin(created_at), and end of task (type = 99) alwaysmax(created_at). Is that so? – Erwin Brandstetter Aug 09 '22 at 03:54correlation_id- the ID is generated by the client when the firsteventsrow is inserted at the start of the task performance, and it is used for all of the events related to that performance of the task to tie them together. We cannot rely onmin(created_at)beingtype = 0andmax(created_at)beingtype = 99, even though that is true for all of the examples I included here. These just happened to be the samples that I had in front of me at the moment. – Moshe Katz Aug 09 '22 at 04:01minandmaxcannot be used to determine type: 1 - some types of tasks have other events that happen before or after the "official" start and end time of the task; 2 - if the task is still in progress there won't be a99yet. I can add more examples if that will help. – Moshe Katz Aug 09 '22 at 04:02min(created_at)beingtype = 0" - Can we still usemin(created_at)to determine the flow of tasks - or do we have to usecreated_atof the row withtype = 0? Can time ranges overlap? – Erwin Brandstetter Aug 09 '22 at 04:16