Efficiently computing distance to large number of geometries

Question

I have a largeish (slightly less than 500000) points and need to compute the distance to the nearest object of a collection of geometries (usually a bunch of polygons, but in one case a bunch of lines). My current approach is to compute the union of the target geometries into a big multipolygon/multiline, put the points in a GeoPandas frame and use the df.geometry.distance() object to compute the distance for each point. It works, but it's ridiculously slow, so I suspect I'm going about this the wrong way.

Given my problem, what's the correct way of computing the distance between each point and the collection of geometries? An efficient solution in Python would be great, but dropping down to raw GDAL in C/C++ isn't a problem either if that's what it takes.

The largest of the polygon collections has about 80000 polygons in it, while the line collections has about 750000 lines in it. In full code, my current approach is:

points = geopandas.read_file('points.shp')
geometries = geopandas.read_file('collection.shp')
big_geometry = geometries.geometry.unary_union
points['distance_to_geometries'] = points.geometry.distance(big_geometry)

For context, the points are the coordinates of buildings and the polygons and lines are various natural hazard things (rivers, flood zones, landslide zones, and the like), all within Norway.

If you can store all your geometries into a PostGIS database, then you would have access to SD_distance or ST_3ddistance. This can be another solution. — swiss_knight, Sep 15 '21 at 18:07
The polygon collections are in the range 500-80000 polygons, the line data set is bigger at about 750000 lines. The computation now is basically big = collection_df.geometry.unary_union; dists = points.geometry.distance(big), because it was the dead easy way to write it. — arnsholt, Sep 15 '21 at 18:17
If I leave the polygon/line collections as is, maybe I could use some kind of spatial indexing technique to narrow down the list of candidates to measure against and speed it up that way? — arnsholt, Sep 15 '21 at 19:32
It would really help to provide some code which creates randomized points, polygons, and lines which is similar to your problem. — Shawn, Oct 07 '21 at 01:31

score 0 · Answer 1 · answered Nov 05 '23 at 13:04

Try sjoin_nearest to join the other dataframe nearest geometry to the points, then measure distance to it:

print(pointdf.shape)
#(328867, 2)
print(polydf.shape)
#(40363, 2)
print(linedf.shape)
#(942947, 2)
#Measure distances from each point to nearest polygon
start = default_timer()
polydf["geombackup"] = polydf.geometry
sj = gpd.sjoin_nearest(left_df=pointdf, right_df=polydf, how="left")
sj["distance"] = sj.apply(lambda x: x.geometry.distance(x["geombackup"]), axis=1)
print(default_timer()-start)
#22.3
#Each point to nearest line
start = default_timer()
linedf["geombackup"] = linedf.geometry
sj = gpd.sjoin_nearest(left_df=pointdf, right_df=linedf, how="left")
sj["distance"] = sj.apply(lambda x: x.geometry.distance(x["geombackup"]), axis=1)
print(default_timer()-start)
#6.5

Then you can merge the distance column back to the points.

Efficiently computing distance to large number of geometries

1 Answers1