Find the longest unbroken and identical chain of a value in field

Question

I'm working with large datasets regarding the concentration of pollutants in rivers. When the field "time" = 1 then the concentration has exceeded a certain limit for 10 mins.

I need to know the longest (and average if possible) continuous period of exceedance.

The field "example" below shows a manual result of what I'm trying to get with the field calculator.

A bit more detail would help. To help me understand, I assume another column on the left would be indication time interval? Is some such id column or the like tucked away in the data? I'd say that would make the analysis easier. And is the order of rows/records in the table actually the same as the data? There is a function in the calculator 'row_number' but how that works will depend on the native arrangement of the table. — BJW, Mar 16 '24 at 20:46
I think this answer is relevant - https://gis.stackexchange.com/a/162015/135109 I would copy your data out of the attribute table (or export layer as xlsx without geometry) into a spreadsheet and do it there. — BJW, Mar 16 '24 at 21:01

score 2 · Answer 1 · answered Mar 18 '24 at 13:07

You can try the following convoluted expressions in Field Calculator, but please note: it's likely to be very slow with large datasets (ref this post on GIS.SE) as it will do a full table scan.

Field Calculator isn't well suited to this kind of windowed aggregation, either a Python (PyQGIS or geopandas) script or SQL via Virtual Layer would be more efficient.

Determine data order

First off, as mentioned in the comments to your question, we need to know how your data is ordered so that we can determine consecutive records to cumulatively sum.

Scenario 1 - Explicit ID column

For the expressions below the ID column must have unique values and be sequential (no gaps).
If you have the data sorted by a column(s) that does not follow the above, you could create the ID column using other methods first (examples here).

Scenario 2 - Default data order

Alternatively, if the default row order in your file/database is as you need it to be (e.g. in your screenshot), we could directly use the innate feature ID value stored as $id
- However, be absolutely sure this is the original sort order.
- Safest thing to do is create a new/virtual field with $id and use that instead per scenario 1.
- Otherwise, right click column headings in Attribute Table > Sort > uncheck 'Defined sort order in attribute table', then check to see that the data is ordered as required.

Apply expressions

You can now use the expressions below. Replace the second line with the string representation of the name of the ID column representing data order as follows:

Explicit ID column: Column name (with double quotation marks if required - e.g. space in column name), wrapped in single quotation marks outside.
- e.g. 'newid', '"new id"'
Implicit $id variable: Wrapped in single quotation marks only. Do not use double quotation marks.
- e.g. '$id'

Generate rank in group

This will generate a sequential number per "group" as in your screenshot.

with_variable('idcol',
              '"newid"',     -- Replace NAME of ID column here, inside single quotes. Inner double quotes as required. e.g. 'newid', '"new id"', '$id'
-- Generate array of ID values where "time" is recorded, sorted by same ID column. 
-- e.g. 2, 3, 4, 7, 8, 9, 10, 13, 16, 17
with_variable('idarr',
              array_agg(eval(@idcol),
                        filter:=coalesce("time",0) > 0,
                        order_by:=eval(@idcol)),
-- Get ID values from previous array which are the first in the "group" 
-- i.e. first after each gap in sequence (= "time" was present in that record but not the one preceding it). 
-- e.g. 2, 7, 13, 16
with_variable('grpheadarr',
              array_filter(@idarr,
                           array_contains(@idarr,@element-1) is false),
-- "Group" all idarr values into their own sub-arrays based on group head defined in previous step. 
-- e.g. [2, 3, 4], [7, 8, 9, 10], [13], [16, 17]
with_variable('grparr',

               array_foreach(@grpheadarr,
                             array_slice(@idarr,
                                         array_find(@idarr,
                                                    @element),
                                         coalesce(array_find(@idarr,
                                                             array_get(@grpheadarr,@counter+1)
                                                             ) - 1,
                                                  -1)
                                         )
                             ),
-- Look up current feature's ID column value and get matching sub-array, then position in sub-array, then +1 as array position starts from 0.
-- e.g. current ID = 3, matching sub-array = [7,8,9,10], rank position in sub-array = 3 (2 + 1)
-- e.g. current ID = 16, matching sub-array = [16,17], rank position in sub-array = 1 (0 + 1)
array_find(
           array_filter(@grparr,
                        array_contains(@element,eval(@idcol))
                        )[0],
           eval(@idcol)
           ) + 1
))))

Generate group sum

This will simply calculate the number of sequential records in each "group" so you can directly derive the longest period. Same as above except last section.

with_variable('idcol',
              '"newid"',     -- Replace NAME of ID column here, inside single quotes. Inner double quotes as required. e.g. 'newid', '"new id"', '$id'
with_variable('idarr',
              array_agg(eval(@idcol),
                        filter:=coalesce("time",0) > 0,
                        order_by:=eval(@idcol)),
with_variable('grpheadarr',
              array_filter(@idarr,
                           array_contains(@idarr,@element-1) is false),
with_variable('grparr',

               array_foreach(@grpheadarr,
                             array_slice(@idarr,
                                         array_find(@idarr,
                                                    @element),
                                         coalesce(array_find(@idarr,
                                                             array_get(@grpheadarr,@counter+1)
                                                             ) - 1,
                                                  -1)
                                         )
                             ),
-- Look up current feature's ID column value and get matching sub-array, then count of sub-array elements
-- e.g. current ID = 3, matching sub-array = [7,8,9,10], sub-array count = 4
-- e.g. current ID = 16, matching sub-array = [16,17], sub-array count = 2
array_length(
           array_filter(@grparr,
                        array_contains(@element,eval(@idcol))
                        )[0])
))))

Results

Using $id

Using explicit ID column

rawid has the original, implicit feature order, shown here to prove that the explicit ID column (newid) is independent

Wow, this is definitely a level above what i'm used to. I ended up having a colleague write a script for me, like you said Qgis isn't suited for this kind of work. Thanks for the thorough and impressive answer. — Simon Iversen, Mar 20 '24 at 12:15
import pandas as pd
Creating dataframe

df = pd.DataFrame([1,0,0,1,1,1,1,1,0,0,0,0,1,0,1,0,1,0,1,1,1,0,1] ,columns=["value"] )

Diff cummulative sum of not equal 0

df['ne0_cumsum'] = df["value"].diff().ne(0).cumsum()

Group by ne0_cumsum

df_groupby = df.groupby(['ne0_cumsum']).count()

Print answers

print("Mean:", df_groupby["value"].mean()) print("Min:", df_groupby["value"].min()) print("Max:", df_groupby["value"].max()) — Simon Iversen, Mar 20 '24 at 12:19

Find the longest unbroken and identical chain of a value in field

1 Answers1

Determine data order

Scenario 1 - Explicit ID column

Scenario 2 - Default data order

Apply expressions

Generate rank in group

Generate group sum

Results

Using $id

Using explicit ID column

Creating dataframe

Diff cummulative sum of not equal 0

Group by ne0_cumsum

Print answers