Using spatial index to intersect points with polygon, when points and polygon have same minimum bounding box?

Question

I have a MultiPolygon representing the city boundary of Houston and it has an extremely complicated boundary. I also have a set of ~900,000 points (that has the same minimum bounding box as that Houston polygon). About ~400,000 of these points are within the polygon but the others lie outside it. Using python, geopandas, and shapely I tried intersecting this polygon with my points using r-tree. But because the points and polygon have the same minimum bounding box, r-tree offers no speed-up. The process currently takes 30+ minutes.

Which (if any) type of spatial index can I use to accelerate my intersection query when the polygon and points have the same minimum bounding box?

Edit to add code snippet here:

sindex = gdf['geometry'].sindex
possible_matches_index = list(sindex.intersection(polygon.bounds))
possible_matches = gdf.iloc[possible_matches_index]
points_in_polygon = possible_matches[possible_matches.intersects(polygon)]

I do not quite understand. Bounding box of a point is a point, isn't it? — user30184, Sep 26 '16 at 15:59
The two bounding boxes in question are 1) the minimum bounding box of the polygon, and 2) the minimum bounding box of a set of 900,000 points. — eos, Sep 26 '16 at 16:02
But your query compares bbox of each point with bbox of the polygon one by one. At least I hope so. It may still be not selective if the points outside the polygon are not outside the bbox of the polygon. Is that your case? — user30184, Sep 26 '16 at 16:03
I believe you want to do this http://www.gaia-gis.it/spatialite-3.0.0-BETA1/WorldBorders.pdf. Split your massive multipolygon to small polygons with few vertices and your spatial index will be selective and rock. — user30184, Sep 26 '16 at 16:15
Yes, basically the spatial index doesn't help because none of the points are outside the bounding box of the polygon. I previously tried what you suggested and divided my polygon into 1,000 sub-polygons then did a fast r-tree intersect to get possible matches. Then I intersected the possible matches with the full polygon geometry to get the actual precise matches. This was the fastest method I could come up with, but it really only yields speed-ups of 2x-3x (almost entirely because that intersection of possible matches with with full geom is still slow). I was hoping for like 100x speed ups. — eos, Sep 26 '16 at 17:12
Forget the full geometry. If you have rectangular pieces you will get exactly one candidate. Not much to check. — user30184, Sep 26 '16 at 17:32
I need to intersect the possible matches with the full geometry to get the precise, actual matches. I only get possible matches with the r-tree index, and I need the final product to be the exact points within my polygon(s) without any false positives or false negatives. — eos, Sep 26 '16 at 17:38
Believe me, you do not need full geometry for exact results if you have accurately clipped parts. Read the SpatiaLite example carefully. May need some thinking to make the same with shapely without having an interim table. — user30184, Sep 26 '16 at 18:09
"R-tree intersection considers bounding boxes only, so it generates false positives" source. Is this not correct? It certainly seems to be the case when I look at how the r-tree algorithm works. In other words, the sub-polygons to points r-tree intersections use the sub-polygons' bounding boxes, creating some false positive matches. So you then must run through them to reduce the possible matches to only those that precisely intersect. — eos, Sep 26 '16 at 18:45
Also, this comment on another thread: 'The R-tree allows you to make a very fast first pass and gives you a set of results that will have "false positives" (bounding boxes may intersect when the geometries precisely do not). Then you go over the set of candidates (fetching them from the shapefile by their index) and do a mathematically precise intersection test using, e.g., Shapely. This is the very same strategy that's employed in spatial databases like PostGIS.' source — eos, Sep 26 '16 at 18:56
Yes, your sources are right. You have discovered yourself why your query is slow and it due to 1) spatial index with your geometries is not selective and 2) computing the result of Intersects is expensive with your extremely complicated boundary. The split-by-grid trick makes spatial index more selective because great deal of the 500000 points which are outside won't be selected at all from the R-Tree. More important is that Intersects is fast because it is performed against more simple polygons which most often are simple rectangles. — user30184, Sep 26 '16 at 19:50

score 1 · Accepted Answer · answered Sep 26 '16 at 19:37

I try to explain with an image. If you do not agree that this strategy works I will delete my answer. However, I will continue to use it in my own work.

I took Texas as an example and I split it into 1 by 1 degree pieces. The grid that shows behind corresponds with the R-Tree index despite at the boundaries where the real R-Tree BBOXes are smaller.

In the Texas case a point never hits more than one R-Tree box. In case of adjacent polygons there may be more hits, but never more than a handful. Next thing to do is to make an Intersects test with a true geometry of the corresponding piece. If point is inside that piece it is inside Texas as well, and in the USA, North America etc.

Yes that image helps explain the logic. I will try to implement something like this in my python project, seems like it should offer considerable speed-up. — eos, Sep 27 '16 at 22:34

Using spatial index to intersect points with polygon, when points and polygon have same minimum bounding box?

1 Answers1

Linked