1

I'm working with large datasets regarding the concentration of pollutants in rivers. When the field "time" = 1 then the concentration has exceeded a certain limit for 10 mins.

I need to know the longest (and average if possible) continuous period of exceedance.

The field "example" below shows a manual result of what I'm trying to get with the field calculator.

enter image description here

GforGIS
  • 3,126
  • 3
  • 19
  • 38
  • 1
    A bit more detail would help. To help me understand, I assume another column on the left would be indication time interval? Is some such id column or the like tucked away in the data? I'd say that would make the analysis easier. And is the order of rows/records in the table actually the same as the data? There is a function in the calculator 'row_number' but how that works will depend on the native arrangement of the table. – BJW Mar 16 '24 at 20:46
  • I think this answer is relevant - https://gis.stackexchange.com/a/162015/135109 I would copy your data out of the attribute table (or export layer as xlsx without geometry) into a spreadsheet and do it there. – BJW Mar 16 '24 at 21:01

1 Answers1

2

You can try the following convoluted expressions in Field Calculator, but please note: it's likely to be very slow with large datasets (ref this post on GIS.SE) as it will do a full table scan.

Field Calculator isn't well suited to this kind of windowed aggregation, either a Python (PyQGIS or geopandas) script or SQL via Virtual Layer would be more efficient.


Determine data order

First off, as mentioned in the comments to your question, we need to know how your data is ordered so that we can determine consecutive records to cumulatively sum.

Scenario 1 - Explicit ID column

  • For the expressions below the ID column must have unique values and be sequential (no gaps).
  • If you have the data sorted by a column(s) that does not follow the above, you could create the ID column using other methods first (examples here).

Scenario 2 - Default data order

  • Alternatively, if the default row order in your file/database is as you need it to be (e.g. in your screenshot), we could directly use the innate feature ID value stored as $id
    • However, be absolutely sure this is the original sort order.
    • Safest thing to do is create a new/virtual field with $id and use that instead per scenario 1.
    • Otherwise, right click column headings in Attribute Table > Sort > uncheck 'Defined sort order in attribute table', then check to see that the data is ordered as required.

Apply expressions

You can now use the expressions below. Replace the second line with the string representation of the name of the ID column representing data order as follows:

  • Explicit ID column: Column name (with double quotation marks if required - e.g. space in column name), wrapped in single quotation marks outside.
    • e.g. 'newid', '"new id"'
  • Implicit $id variable: Wrapped in single quotation marks only. Do not use double quotation marks.
    • e.g. '$id'

Generate rank in group

This will generate a sequential number per "group" as in your screenshot.

with_variable('idcol',
              '"newid"',     -- Replace NAME of ID column here, inside single quotes. Inner double quotes as required. e.g. 'newid', '"new id"', '$id'

-- Generate array of ID values where "time" is recorded, sorted by same ID column. -- e.g. 2, 3, 4, 7, 8, 9, 10, 13, 16, 17

with_variable('idarr', array_agg(eval(@idcol), filter:=coalesce("time",0) > 0, order_by:=eval(@idcol)),

-- Get ID values from previous array which are the first in the "group" -- i.e. first after each gap in sequence (= "time" was present in that record but not the one preceding it). -- e.g. 2, 7, 13, 16

with_variable('grpheadarr', array_filter(@idarr, array_contains(@idarr,@element-1) is false),

-- "Group" all idarr values into their own sub-arrays based on group head defined in previous step. -- e.g. [2, 3, 4], [7, 8, 9, 10], [13], [16, 17]

with_variable('grparr',
array_foreach(@grpheadarr, array_slice(@idarr, array_find(@idarr, @element), coalesce(array_find(@idarr, array_get(@grpheadarr,@counter+1) ) - 1, -1) ) ),

-- Look up current feature's ID column value and get matching sub-array, then position in sub-array, then +1 as array position starts from 0. -- e.g. current ID = 3, matching sub-array = [7,8,9,10], rank position in sub-array = 3 (2 + 1) -- e.g. current ID = 16, matching sub-array = [16,17], rank position in sub-array = 1 (0 + 1)

array_find( array_filter(@grparr, array_contains(@element,eval(@idcol)) )[0], eval(@idcol) ) + 1

))))

Generate group sum

This will simply calculate the number of sequential records in each "group" so you can directly derive the longest period. Same as above except last section.

with_variable('idcol',
              '"newid"',     -- Replace NAME of ID column here, inside single quotes. Inner double quotes as required. e.g. 'newid', '"new id"', '$id'

with_variable('idarr', array_agg(eval(@idcol), filter:=coalesce("time",0) > 0, order_by:=eval(@idcol)),

with_variable('grpheadarr', array_filter(@idarr, array_contains(@idarr,@element-1) is false),

with_variable('grparr',
array_foreach(@grpheadarr, array_slice(@idarr, array_find(@idarr, @element), coalesce(array_find(@idarr, array_get(@grpheadarr,@counter+1) ) - 1, -1) ) ),

-- Look up current feature's ID column value and get matching sub-array, then count of sub-array elements -- e.g. current ID = 3, matching sub-array = [7,8,9,10], sub-array count = 4 -- e.g. current ID = 16, matching sub-array = [16,17], sub-array count = 2

array_length( array_filter(@grparr, array_contains(@element,eval(@idcol)) )[0])

))))


Results

Using $id

enter image description here

Using explicit ID column

  • rawid has the original, implicit feature order, shown here to prove that the explicit ID column (newid) is independent

enter image description here

she_weeds
  • 12,488
  • 1
  • 28
  • 58
  • Wow, this is definitely a level above what i'm used to. I ended up having a colleague write a script for me, like you said Qgis isn't suited for this kind of work. Thanks for the thorough and impressive answer. – Simon Iversen Mar 20 '24 at 12:15
  • import pandas as pd

    Creating dataframe

    df = pd.DataFrame([1,0,0,1,1,1,1,1,0,0,0,0,1,0,1,0,1,0,1,1,1,0,1] ,columns=["value"] )

    Diff cummulative sum of not equal 0

    df['ne0_cumsum'] = df["value"].diff().ne(0).cumsum()

    Group by ne0_cumsum

    df_groupby = df.groupby(['ne0_cumsum']).count()

    Print answers

    print("Mean:", df_groupby["value"].mean()) print("Min:", df_groupby["value"].min()) print("Max:", df_groupby["value"].max())

    – Simon Iversen Mar 20 '24 at 12:19