geopandas spatial join extremely slow

Question

I am using the code below to find a country (and sometimes state) for millions of GPS points. The code currently takes about one second per point, which is incredibly slow. The shapefile is 6 MB.

I read that geopandas uses rtrees for spatial joins, making them incredibly efficient, but this does not seem to work here. What am I doing wrong? I was hoping for a thousand points per second or so.

The shapefile and csv can be downloaded here (5MB): https://www.dropbox.com/s/gdkxtpqupj0sidm/SpatialJoin.zip?dl=0

import pandas as pd
import geopandas as gpd
from geopandas import GeoDataFrame, read_file
from geopandas.tools import sjoin
from shapely.geometry import Point, mapping,shape
import time


#parameters
shapefile="K:/.../Shapefiles/Used/World.shp"
df=pd.read_csv("K:/.../output2.csv",index_col=None,nrows=20)# Limit to 20 rows for testing    

if __name__=="__main__":
    start=time.time()
    df['geometry'] = df.apply(lambda z: Point(z.Longitude, z.Latitude), axis=1)
    PointsGeodataframe = gpd.GeoDataFrame(df)
    PolygonsGeodataframe = gpd.GeoDataFrame.from_file(shapefile)
    PointsGeodataframe.crs = PolygonsGeodataframe.crs
    print time.time()-start
    merged=sjoin(PointsGeodataframe, PolygonsGeodataframe, how='left')
    print time.time()-start
    merged.to_csv("K:/01. Personal/04. Models/10. Location/output.csv",index=None)
    print time.time()-start

score 21 · Accepted Answer · edited Jan 05 '22 at 17:34

21

adding the argument predicate='within' in the sjoin function speeds up the point-in-polygon operation dramatically.

Default value is predicate='intersects', which I guess would also lead to correct result, but is 100 to 1000 times slower.

edited Jan 05 '22 at 17:34

gcamargo

188
1
7

answered Dec 30 '15 at 13:28

Alexis Eggermont

943
1
10
18

5

To anyone reading this, this does not mean that within is generally somehow faster, read nick_g's answer below. – inc42 Nov 30 '19 at 09:15

score 17 · Answer 2 · edited Jan 05 '22 at 17:34

17

What's likely going on here is that only the dataframe on the right is fed into the rtree index: https://github.com/geopandas/geopandas/blob/master/geopandas/tools/sjoin.py#L48-L55 Which for an predicate="intersects" run would mean the Polygon was fed into the index, so for every point, the corresponding polygon is found through the rtree index.

But for predicate="within", the geodataframes are flipped since the operation is actually the inverse of contains: https://github.com/geopandas/geopandas/blob/master/geopandas/tools/sjoin.py#L41-L43

So what happened when you switched the predicate from predicate="intersects" to predicate="within" is that for every polygon, the corresponding points are found through the rtree index, which in your case sped up the query.

edited Jan 05 '22 at 17:34

gcamargo

188
1
7

answered Oct 27 '16 at 16:02

nick_g

463
4
8

1

You used non-permanent URLs, could you update them to a specific revision maybe? – inc42 Nov 30 '19 at 09:08
link to most recent release with the issue persisting: https://github.com/geopandas/geopandas/blob/v0.12.2/geopandas/tools/sjoin.py#L216 – camdenl Dec 27 '22 at 18:35

score 13 · Answer 3 · answered Oct 27 '16 at 00:18

The question asks how to take advantage of r-tree in geopandas spatial joins, and another responder correctly points out that you should use 'within' instead of 'intersects'. However, you can also take advantage of an r-tree spatial index in geopandas while using intersects/intersection, as demonstrated in this geopandas r-tree tutorial:

spatial_index = gdf.sindex
possible_matches_index = list(spatial_index.intersection(polygon.bounds))
possible_matches = gdf.iloc[possible_matches_index]
precise_matches = possible_matches[possible_matches.intersects(polygon)]

camdenl · Answer 4 · 2022-12-27T19:04:12.413

0

Related to the issue in geopandas discussed in this answer and seen here in the latest release:

If you have two geodataframes and want to perform an intersects operation and you know which spatial index should be queried, you can join from small to large and use how="right" to force this index to be used:

small_gdf.sjoin(large_gdf, predicate="intersects", how="right")

edited Dec 27 '22 at 19:04

answered Dec 27 '22 at 18:40

camdenl

1,213
2
12
27

geopandas spatial join extremely slow

4 Answers4

Linked