7

How to speed up the "union" of two shapefiles using geopandas?

I have two large shapefiles:

geodataframe_left.shape = (3610, 12) (500 MB) (link)

geodataframe_right.shape = (16396, 3) (200 MB) (link)

One file contains global province boundaries (GADM), The other shapefile contains Hydrobasin level 6. I've tried multiple approaches (PostGIS, Google Bigquery, ArcGIS) but nothing beats geopandas' one line command:

gdf_union = gpd.overlay(gdf_left, gdf_right, how='union')

there is however one problem. The command is extremely slow on larger datasets including mine. I've waited for almost an hour now without success. I'm running my code on a large amazon EC2 instance (32GB RAM).

How can I improve the performance of this operation? Is it possible to use multiprocessing for example or can I split the problem into smaller chunks so I can monitor progress?

UPDATE:

  • I tried updating to geopandas 0.4.0 .. still no result after running for two hours.
  • Used the use_sindex = True flag. Deprecated since 0.4.0 (always uses sindex)
  • I implemented the tile approach and found weird behavior. Details below.

I created a fishnet grid of polygons of 10 x 10 degrees that I use to clip both layers before performing the union. The processing times per cell differ wildly and there are 4 cells that take > 10 minutes to process. One cell takes even 40 minutes.

When the process runs in parallel I get the following error (script continues though):

AttributeError: 'IndexStreamHandle' object has no attribute '_ptr'
Exception ignored in: <bound method Handle.__del__ of <rtree.index.IndexStreamHandle object at 0x7f007cb42e10>>
Traceback (most recent call last):
  File "/opt/anaconda3/envs/python35/lib/python3.5/site-packages/rtree/index.py", line 875, in __del__
    self.destroy()
  File "/opt/anaconda3/envs/python35/lib/python3.5/site-packages/rtree/index.py", line 863, in destroy
    if self._ptr is not None:

enter image description here

What's interesting is that these cells are at the 180 meridian. I've uploaded the GPKG here

enter image description here

Maybe shapely doesn't like this hemisphere crossing stuff.

UPDATE 2:

Running ArcMap locally, the process takes appr. 10 minutes. Instead of using geopackages, I used shapefiles. file1 file2

Executing: Union "hybas_lev06_v1c_merged_fiona_V04 #;gadm36_1 #" C:\Users\Rutger.Hofste\Desktop\werkmap\union_benchmark\output\union_arcgis_global.shp ALL # GAPS
Start Time: Fri Nov 30 10:32:04 2018
Reading Features...
Processing Tiles...
Assembling Tile Features...
Succeeded at Fri Nov 30 10:43:02 2018 (Elapsed Time: 10 minutes 57 seconds)

I also tried "Union" (both default and SAGA) in QGIS but this process is also ridiculously slow. I ran for 30min and the progress bar was at 4%.

UPDATE 3: I'm implementing the tiled approach however for a small number of polygons the result of an intersection in shapely is not a multipologon or polygon but a geometrycollection with multiple geometries including lineStrings or just LineStrings. I would expect the result of an intersection to be polygon or multipolygon.

RutgerH
  • 3,285
  • 3
  • 27
  • 55
  • This seems to be very relevant: https://jorisvandenbossche.github.io/blog/2017/09/19/geopandas-cython/ – RutgerH Nov 28 '18 at 16:15
  • I can't run your code with the provided data without this error: AttributeError: 'MultiPolygon' object has no attribute 'exterior' – Jon Nov 28 '18 at 16:22
  • interesting. In my script I'm actually querying the data from a postGIS database (that was my initial approach). The data should be very similar though. Maybe you can try to run the union with "valid" geometries only. – RutgerH Nov 28 '18 at 16:29
  • 1
    You can clip both layers into Tiles, Union the Tiles, then merge the results. More steps but can be run in-parallel. – klewis Nov 28 '18 at 18:00
  • Which version of GeoPandas are you using? The overlay function got a revamp in 0.4.0, so make sure to use that (although with such "big" data, I can still imagine it being slow at the moment. This would be a nice use case to see if we can further speed it up) – joris Nov 28 '18 at 22:51
  • currently I have 0.3.0 Let me update geopandas and try again. – RutgerH Nov 29 '18 at 09:02
  • No results after running the script for 2 hours and geopandas 0.4.0 Will try the tiled approach this afternoon. – RutgerH Nov 29 '18 at 12:04
  • would you suggest to install the Cythonized version of geopandas to speed this up? – RutgerH Nov 29 '18 at 15:01
  • I followed @klewis approach but after I merge the results the results, I need to dissolve the geometries based on unique identifiers. Even this dissolve step takes forever. – RutgerH Nov 29 '18 at 15:44
  • 1
    The .overlay command has an optional 5th parameter, use_sindex, have you tried this? – klewis Nov 29 '18 at 16:06
  • 1
    @RutgerH I am afraid that the cythonized version of geopandas will not help much. I did a timing with a small subset of the data (subset in Europe, one of the parts that take long), and ~90% of the time is spent inside GEOS itself, the underlying C library that does the union/intersection/difference operations. – joris Nov 29 '18 at 23:10
  • 1
    @klewis use_sindex is deprecated in the 0.4 version of geopandas, as the new implementation of overlay always uses a spatial index. – joris Nov 29 '18 at 23:12
  • @RutgerH Depending the requirements of your application, some ideas: 1) do you need the high resolution? (if not, if the geometries could be simplified this would speed up the overlay operation) 2) Would a 'intersection' overlay also suffice, instead of a 'union' overlay (it gives rougly the same result, apart from the areas near the boundaries of the continents that do not overlap, and it quite a bit faster from a quick test). Further, you mention other systems in your post. Did you eg try QGIS? Is it faster there? – joris Nov 29 '18 at 23:18
  • thank you for the comment @joris For my application it is important to follow coastlines and political boundaries at a high resolution. Therefore I'd rather not simplify nor use intersect. I've also performed this operation in ArcMap (appr. 10 min) and QGIS 2 (20 min) although my goal is to have "union" in the ArcMap interpretation. – RutgerH Nov 30 '18 at 08:17
  • I find it intriguing why QGIS would be so much faster, as they seem to use the same algorithm logic (and also rely on the same GEOS library for some of the operations). Something to investigate! – joris Nov 30 '18 at 08:38
  • I've added the arcMap benchmark to the question. Contrary to what I said earlier, union in QGIS with these shapefiles is ridiculously slow as well. I ran the process for 30min now and the progress bar is only at 4%. This might be a more fundamental issue in GEOS. – RutgerH Nov 30 '18 at 10:12
  • I've asked another broader question. The performance issue seems persistent over multiple approaches inluding, QGIS, GeoPandas, PostGIS etc. https://gis.stackexchange.com/questions/304516/why-is-union-in-arcmap-much-faster-than-other-approaches – RutgerH Nov 30 '18 at 10:51
  • I've found a pretty fundamental issue with Shapely which is GEOS based. I've documented my findings here: https://gis.stackexchange.com/questions/304712/no-output-of-simple-intersection-in-qgis-shapely – RutgerH Dec 03 '18 at 12:39

0 Answers0