2

Taking the union* of these two large shapefiles file1 file2 locally using ArcMap takes appr. 10 min.

Over the last few weeks, I've tried to obtain the same results using other, open-source approaches without success. I am curious why only ArcMap seems to generate results in a reasonable amount of time without errors.

The approaches I've tried so far:

QGIS

The most straight-forward alternative. There are two "union" functions available in QGIS. One default (part of vector overlay), one based on SAGA. Both functions create the desired results on smaller samples but fail to run (or take forever) on my shapefiles. I tried running union for 30 minutes after which the progress bar was at appr. 4%

enter image description here

PostGIS

I've ingested the shapefiles on an AWS RDS instance running postGreSQL with PostGIS enabled. The approach is not so straight forward and requires a hideous long SQL query to obtain the same result. Most importantly, the query did not create a meaningful result on the full datasets. It did run properly on a sample though.

See separate question here

CREATE TABLE test.sqlislelijk AS 
-- input data
with polys1 AS (
  SELECT 
    pfaf_id as df1,
    geom as g
  FROM hybas06_v04
),
polys2 AS (
  SELECT 
    aqid as df2,
    geom as g
  FROM y2018m11d14_rh_whymap_to_rds_v01_v01
),
-- intersections
intersections AS (
  SELECT df1, df2, ST_INTERSECTION(a.g, b.g) i, a.g AS g1, b.g AS g2 
  FROM polys1 a, polys2 b WHERE ST_INTERSECTS(a.g, b.g)
),
-- per-row union of intersections with this row
diff1 AS (
  SELECT df1, ST_UNION(i) i FROM intersections GROUP BY df1
),
diff2 AS (
  SELECT df2, ST_UNION(i) i FROM intersections GROUP BY df2
),
-- various combinations of intersections
pairs AS (
  SELECT df1, df2, i AS g FROM intersections
  UNION ALL
  SELECT 
    p.df1,
    NULL,
    CASE
      WHEN i IS NULL THEN g 
      ELSE ST_DIFFERENCE(g, i)
    END
  FROM polys1 p LEFT JOIN diff1 d ON p.df1 = d.df1
  UNION ALL
  SELECT
    NULL,
    p.df2,
    CASE
      WHEN i IS NULL THEN g
      ELSE ST_DIFFERENCE(g, i)
    END
  FROM polys2 p LEFT JOIN diff2 d ON p.df2 = d.df2  
)
SELECT 
  df1 as pfaf_id,
  df2 as aqid,
  g as geom
FROM pairs WHERE NOT ST_IsEmpty(g);

Google BigQuery

Google's GIS extension for BQ looks promising but I experienced weird behavior and internal errors, probably due to mixing of GEOMETRY and GEOGRAPHY objects. Also, the syntax is similar to PostGIS and a command of 1 line in geopandas takes >20 lines in SQL.

GeoPandas

Probably my favorite approach. The out-of the box approach does't give any results after running it for several hours. My latest approach includes tiling the shapefiles, use multiprocessing on each tile and merge them. Not very straight forward and one 10x10 degree tile till takes 40+ minutes to calculate.

See Overlay Union Geopandas improve performance

It is the first time in my career that I am impressed by the performance of ArcMap. It seems that ArcMap handles this use-case much, much better than anything else. The issue seems persistent over all solutions I've tried so far. Is there something wrong in the underlying libraries of the open-source/alternative solutions?

  • Note that Union in ArcMap means something different than in the Shapely / PostGIS world.
PolyGeo
  • 65,136
  • 29
  • 109
  • 338
RutgerH
  • 3,285
  • 3
  • 27
  • 55
  • 1
    Let's stick to Union only to keep the question specific – RutgerH Nov 30 '18 at 10:54
  • 1
    Which QGis did you use? Did you convert to geopackage? – Erik Nov 30 '18 at 10:55
  • I've added the details for QGIS. I tried both approaches: Input GPKG, output GPKG and Input SHP Output SHP. Would you expect any difference in performance? – RutgerH Nov 30 '18 at 10:58
  • 1
    Could you clarify what you mean by different in Postgis -- your image above looks a lot like ST_Union to me. Also, you say the that it involves a hideous long query, without stating what that was. Off the bat, I would regard that as an extremely straightforward query. – John Powell Nov 30 '18 at 10:59
  • sure, I've also asked that question here: https://gis.stackexchange.com/questions/302433/seeking-postgis-equivalent-to-geopandas-union-overlay/303201#303201 The difference is that the output is not dissolved. https://desktop.arcgis.com/en/arcmap/10.3/tools/coverage-toolbox/dissolve.htm – RutgerH Nov 30 '18 at 11:01
  • 1
    I have also has a look at BQ spatial. And while I am encouraged that Google is pushing it, it only supports Geography and a very small function set, so far now, not that useful (to me, anyway). GeoMesa is another possibility, but, I suspect it is possible to do this with one of the other approaches. – John Powell Nov 30 '18 at 11:01
  • 1
    OK, it might be good to include the link in an edit to that question (which I remember as a good one that I didn't have time to take on), as talking about performance without giving clarity on the underlying query is somewhat misleading. – John Powell Nov 30 '18 at 11:04
  • @JohnPowell not sure if I understand you correctly. What would you like me to add ? – RutgerH Nov 30 '18 at 11:07
  • Maybe a link to the previous question, explaining the query. But I see you have posted the query, itself, and agreed, it is hideous. – John Powell Nov 30 '18 at 11:13
  • At least one of the layers contains faulty geometries. – Erik Nov 30 '18 at 11:14
  • 3
    Try a newer qgis -- from your screenshot you are still on 2, but these algorithms were much improved in 3 – ndawson Nov 30 '18 at 11:38
  • @Erik ArcMap does seem to handle the geomerties just fine – RutgerH Nov 30 '18 at 11:56
  • 2
    Have you tuned PostGIS for PostGIS and have you used spatial indexes and clustering? – MappaGnosis Nov 30 '18 at 11:58
  • spatial indexes: Yes clustering: Probably not. Any suggestion where I can learn more about that? now reading: https://postgis.net/docs/performance_tips.html – RutgerH Nov 30 '18 at 12:02
  • See answer to https://gis.stackexchange.com/questions/301598/only-union-dissolve-intersecting-features-to-save-time, example of clustering. Made a huge time difference for me. (I know its not the same Union) – BERA Nov 30 '18 at 12:16
  • 1
    Are you sure that the ArcMap union gives exactly the same result as the GeoPandas or QGIS one? (when evaluating all options on a smaller area, it should be relatively easy to compare) – joris Nov 30 '18 at 12:31
  • 2
    This is more of a discussion topic than question for the GIS SE's Q&A model. Keep in mind that the forerunner to ArcGIS was created once the code for topological overlay proved reliable; Esri has had a lot of time to refine and optimize. – Vince Nov 30 '18 at 12:35
  • @Vince where would you suggest to continue this discussion. I think there is value for the GIS community – RutgerH Nov 30 '18 at 12:44
  • 2
    That, of course, is the problem. While GIS SE has a [chat] feature, it is underutilized. Folks come to GIS SE for the "no chit-chat", so it's not particularly surprising that the chat function resembles the fairgrounds a month after the fair. – Vince Nov 30 '18 at 13:19
  • Have you tried GRASS directly or via QGIS Processing? I remember long ago when running overlays in ArcGIS with large datasets that I often could only get the overlays to complete by using coverages. GRASS topology reminds me of coverages Back the old days ESRI suggested the user should consider tiling the inputs and eventually they did that well in the tools themselves. – John Nov 30 '18 at 13:22

0 Answers0