postgis recursive query to spatially select stream network connectivity

Question

I have a python class that takes a polyline streams layer, selects 1 record and tries to find the network connectivity by using attribute and spatial methods. For this question I will only focus on the spatial class method. I want to convert the spatial class method into a postgis query.

Here is the python class for reference:

import arcpy
arcpy.env.workspace=r'path_gdb'
arcpy.env.overwriteOutput=True

class Stream_Connection():
    def __init__(self,stream_layer,stream_name,query_name,where_name):
        self.stream_layer=stream_layer #polyline streams layer
        self.stream_name=stream_name #stream name to call FC
        self.query_name=query_name   #isolate stream record to start
        self.where_name=where_name   #query for all records that contain part of the name
        self.stream_network=None
    def attribute(self):
        '''this class method finds all the records that have part of the stream/river name
            in it and insert it into the self.stream_network layer'''
        query="SW_NAME = '{}'".format(self.query_name)
        where_clause_qry="SW_NAME LIKE '%{}%' ".format(self.where_name)
        self.stream_network=arcpy.Select_analysis(self.stream_layer,self.stream_name,query)
        with arcpy.da.InsertCursor(self.stream_network,("SW_NAME","SHAPE@")) as insert:
            with arcpy.da.SearchCursor(self.stream_layer,("SW_NAME","SHAPE@"),where_clause=where_clause_qry) as cur:
                for row in cur:
                    insert.insertRow(row)
    def spatial(self):
        '''this class method explodes the full polyline streams layer into singleparts
            and runs a while loop that finds all the streams that intersect the self.stream_network layer.
            those intersecting lines get added to the self.stream_network
            and continues until the number of records plateau and remain the same(no more features that intersect)'''
        streams_explode=arcpy.MultipartToSinglepart_management(self.stream_layer,r"in_memory\streams_boom")
        streams_explode_fl=arcpy.MakeFeatureLayer_management(streams_explode,'explode')
        previous=0
        while int(arcpy.GetCount_management(arcpy.MakeFeatureLayer_management(self.stream_network,'sn')).getOutput(0)) >0:
            nums= int(arcpy.GetCount_management(arcpy.MakeFeatureLayer_management(self.stream_network,'sn')).getOutput(0))
            streams_fl=arcpy.MakeFeatureLayer_management(self.stream_network,'sn')
            arcpy.SelectLayerByLocation_management(streams_explode_fl,"INTERSECT",streams_fl)
            #arcpy.SelectLayerByLocation_management(streams_explode_fl,"WITHIN_A_DISTANCE",streams_fl,"2 Feet","NEW_SELECTION")
            arcpy.Append_management(streams_explode_fl,streams_fl,schema_type="NO_TEST")
            arcpy.Select_analysis(streams_fl,'interm')
            arcpy.Dissolve_management('interm',self.stream_name,['SW_NAME'])
            print nums
            if previous == nums: break
            previous=nums

streams=Stream_Connection('streams','Musconetcong_Network','Musconetcong River','Musconetcong')
streams.attribute()
streams.spatial()

What the spatial class method does:

1.  Explodes the polyline streams layer into singleparts 
2.  Takes the record or records of streams selected in the attribute method(self.stream_network)
3.  Finds the singlepart streams that intersect the self.stream_network and append those records to it. This keeps running until there are no more new records getting added to the self.stream_network

Trying it out in postgis:

drop table if exists stream_connectivity;
create table stream_connectivity as
with RECURSIVE stream_connection as (
    select shape, sw_name 
        from streams where sw_name = 'Musconetcong River'
    union all
    select b.shape, b.sw_name
        from stream_connection a join streams_boom b 
    on st_intersects(a.shape,b.shape)
    )
select * from stream_connection

so from this tutorial http://www.postgresqltutorial.com/postgresql-recursive-query/ I have formulated this query. and from what I gather the first part of the CTE query selects 'Musconetcong River' record, then the second part in the CTE calls the first part and spatially joins the streams_boom layer. then those results are added to the 1st part.. and on and on and at some point during this recursive process if the results return zero then the query ends? This query has been running for 2+ days. The python script takes just 2 min. so I am just curious if I am formulating this correctly

UPDATE from comments and docs:

drop table if exists stream_connectivity;
create table stream_connectivity as
with RECURSIVE stream_connection(sw_name,shape,objectid) 
    as (
        select shape, sw_name,objectid,
           ARRAY[objectid] AS idlist
           from streams_boom where sw_name = 'Musconetcong River'
        union all
        select b.shape, b.sw_name,b.objectid,
           array_append(a.idlist,b.objectid) AS idlist
           from stream_connection a,streams_boom b 
           where st_intersects(a.shape,b.shape) 
           AND NOT a.idlist @> ARRAY[b.objectid]
    )
select objectid,sw_name,shape from stream_connection

FINAL QUERY THAT WORKED

I have correctly converted the entire python Stream_Connection class into a postgis query with the guidance of @ThingumaBob

*streamsx instead of streams_boom *query took 2 hours instead of 2 min :( so if anyone has any optimization recommendations please let me know

drop table if exists scrap.stream_connectivity;
create table scrap.stream_connectivity as
WITH
    segment AS (
        SELECT geom
        FROM scrap.streamsx
        WHERE sw_name = 'Musconetcong River'
    ),
    network AS (
        SELECT ST_Union(a.geom) AS geom,array_agg(id) ids
        FROM (
            SELECT ST_ClusterDBSCAN(geom, 0, 1) OVER() AS cluster_id,
                   geom,id
            FROM scrap.streamsx
        ) AS a
        GROUP BY cluster_id
    ),
sections as(SELECT (st_dump(a.geom)).geom geom,unnest(ids) id
        FROM network AS a
        JOIN segment AS b
            ON ST_dwithin(a.geom, b.geom,.01)
        ),
spatial_finished as(select a.geom,a.id,b.sw_name 
            from sections a join scrap.streamsx b
            on a.id=b.id
        ),
attribute as(select geom,id,sw_name 
        from scrap.streamsx 
        where sw_name like '%Musconetcong%'
        ),
distinct_ as(select distinct on(geom) geom,id,sw_name
        from spatial_finished
        union
         select distinct on(geom) geom,id,sw_name from attribute
         )
select st_union(geom) geom,sw_name
    from distinct_ group by sw_name

The problem with your query is that there is no stop condition, so you are in an infinite loop. Recursive queries are a good way to do graph search, but typically you start with an an array of your initial stream id and in the recursive part you add to this array and check that this path hasn't already been search. If you look at the graph_search example in the docs you will see how this works. — John Powell, May 02 '18 at 14:33
@JohnPowellakaBarça read the docs so essentially I need a stopping mechanism. The best way to accomplish this is to create an array of unique id's and to check if those ids are already in the array. The docs mention path but I am not exactly sure what it is referring to?(unique id?). I am posting an updated query from your advice and what the docs say — ziggy, May 02 '18 at 16:23
Path could be anything, it is just the name for an array of already searched segments and the ANY checks that you haven't already traversed that route and if so, sets cycle to true, which breaks out of the loop. — John Powell, May 02 '18 at 16:32
what you try there will only work if the result of the recursive part will always return only one record (I'm not sure if it will work even then)! as soon as there are two, it will go into loop. problem is, you cannot use aggregates in the recursive part to merge both the geoms and the arrays — geozelot, May 02 '18 at 16:58
@JohnPowellakaBarça any chance you could post an answer with the recursion technique using path and cycle. only columns I am concerned with are shape,sw_name and objectid (which is unique) — ziggy, May 02 '18 at 17:59
...actually, the query you used in your update is equivalent to the path and cycle example and should work fine on a linked graph...unfortunately, in your case, as soon as that query hits an intersection where the river forks (e.g. two or more rows returning), it will also loop. — geozelot, May 02 '18 at 18:25
so is there anyway I can tweak my updated recursive query? still running your query (w/ slight mods) 25 + min and counting... — ziggy, May 02 '18 at 18:33
I doubt it...you could better implement that function in pl/pgsql. for the cluster approach; if you aren't working with billions and billions of rows, sth. is wrong...how does that query look like? — geozelot, May 02 '18 at 19:08
I don't have time atm. You still don't have a stopping condition. The any(id) part sets the cycle to true, and this is what causes the looping to break. Sorry to be a pita, but I found with recursive queries, that I had to do them myself a few times till I got it. They are very, very non-trivial to grok. I disagree with Thingumabob about the intersection, as the new link will not have been traversed, even if the start node has. — John Powell, May 03 '18 at 08:30

geozelot · Accepted Answer · 2018-05-02T15:08:18.563

2

Does it have to be a recursive query? Those are a) rather complicated if not familiar with and b) rather inefficient for this specific task, at least in comparison to ST_ClusterDBSCAN.

Assuming streams_boom are your streams exploded into LINESTRINGs as geom with their name as name, running

SELECT name,
       ST_ClusterDBSCAN(geom, 0, 1) OVER(PARTITION BY name) AS conn_netw_id,
       geom
FROM streams_boom

would return your geometries with a conn_netw_id assigned to each connected sub-network grouped by name (note: there will be conn_netw_id's ranging from 0 - n for each name)

Update:

To get only that network connected to a pre-selected segment using ST_ClusterDBSCAN, you could do sth. like (not optimized):

WITH
    segment AS (
        SELECT geom
        FROM streams_boom
        WHERE <segment_id> = <segment_of_interest>
    ),
    network AS (
        SELECT ST_Union(a.geom) AS geom
        FROM (
            SELECT ST_ClusterDBSCAN(geom, 0, 1) OVER() AS cluster_id,
                   geom
            FROM streams_boom
            WHERE <river_system> = <river_system_of_interest>
        ) AS a
        GROUP BY cluster_id
    )

SELECT a.*
FROM network AS a
JOIN segment AS b
  ON ST_Intersects(a.geom, b.geom)

with <segment_id> being the uid of the segment of interest and <stream_system> being e.g. the name of the larger network (i.e. if you were looking for a sub-network connected to the segment with uid = 1 in the Amazonas, 'Amazonas' would be the <stream_system>; I used this to limit the clustering on the river system you are looking at)

edited May 02 '18 at 15:08

answered May 02 '18 at 14:03

geozelot

30,050
4
32
56

it can be any type of query. But how would I specify that I only want the records that intersect the 1 record I initially select, like in my python example. I cannot just add a where statement to the end of this query. where sw_name='Musconetcong_Network' with your example returns no records – ziggy May 02 '18 at 14:09
i would need some sort of cross join/self join to go along with ST_ClusterDBSCAN – ziggy May 02 '18 at 14:11
ah, sorry, didn't realize the segment selection...yes, you could run the clustering in a sub-query, union the results grouped by name and cluster_id and select the cluster that intersects with the segment you are looking for. give me a minute, Iĺl add to my answer... – geozelot May 02 '18 at 14:24
I'm not convinced that distance based clustering is the best way to do graph search. This seems like a good use case for a recursive query, albeit one that is not an infinite loop :D. It is possible that I have misunderstood the question, though. – John Powell May 02 '18 at 14:37
@JohnPowellakaBarça yeah there's overhead now I agree, and recursive queries do make sense here, but even with the above construct I get results in 200ms from a table with 250k rows; nothing beats clustering on intersection (distance = 0) on C level functions.... – geozelot May 02 '18 at 14:46
@ThingumaBob. Ah, yes, distance=0, does give the same as a graph search for connectivity. Clever. Recursive queries do my head in every time I look at them, though they have their uses, and, yes, they are generally quite slow, so if there is another way, it is better. +1 – John Powell May 02 '18 at 14:52
@ThingumaBob running this now over lunch, crossing my fingers. will update – ziggy May 02 '18 at 16:18
did not have a use for WHERE <river_system> = <river_system_of_interest> so the whole stream layer is getting put through the DBSCAN (which is what I want). I changed st_union to st_collect(which should speed it up) and added a column sw_name to the output. so its getting grouped by cluster_id and sw_name.. gonna let it run overnight we'll see what the deal is tomorrow – ziggy May 02 '18 at 20:03
update: query ran for 2+ hours but did not return what I wanted...I will take out the stream names(might have been the problem). but this query alone SELECT ST_ClusterDBSCAN(geom, 0, 1) OVER() AS cluster_id, geom FROM streams_boom has been running for over 30+ minutes. the table has 15k rows with proper spatial indexes – ziggy May 03 '18 at 13:45
explain analyze verbose of the above comment query "WindowAgg (cost=0.00..3859.12 rows=15694 width=3363) (actual time=5180908.441..5180959.210 rows=15694 loops=1)" " Output: st_clusterdbscan(geom, '0'::double precision, 1) OVER (?), geom" " -> Seq Scan on streams_boom (cost=0.00..3662.94 rows=15694 width=3363) (actual time=0.024..15.234 rows=15694 loops=1)" " Output: geom" "Planning time: 8.502 ms" "Execution time: 5180988.978 ms" – ziggy May 03 '18 at 17:27
@ziggy: go chat? – geozelot May 03 '18 at 18:20
so the ST_ClusterDBSCAN recommendation led me to finally come up with a good solution, which I will post in my question. Thanks for advice I appreciate it – ziggy May 08 '18 at 16:22

postgis recursive query to spatially select stream network connectivity

1 Answers1

Linked