1

I'm using a version of Solution 2 of the selected answer here to build a new shapefile of all points that fall within a given polygon object (i.e. a clip).

When run against 27K points, it takes about 9 minutes. In arcpy it takes < 10 seconds.

I cant find any documentation on the Clip_analysis() code (is it proprietary?) so I am curious as to how it can execute the checks of 27K so fast?

Is it multi-processed? Is it calling some type of map() on the iterable of points?

PolyGeo
  • 65,136
  • 29
  • 109
  • 338
pstatix
  • 426
  • 4
  • 13
  • All Esri code is proprietary, you will not find the source anywhere unless you are a developer working for Esri. However, I have found OGR2OGR -clipsrc to be even faster, especially with polylines. It is no surprise to me that assembly instructions (executables) are faster than python (an interpretative language), it just highlights the efficiency of low level code (C, C++, fortran, assembly) vs high level languages (python, VBA). You also have to be aware that Esri has been around since the 80s' and has optimized their base code through years of coding. – Michael Stimson Sep 05 '17 at 22:10
  • Very common reason for such big difference in speed is that the fast solution is using spatial index but the slow one makes Cartesian product. Differences between languages are usually nominal compared with good vs bad general logic. – user30184 Sep 06 '17 at 04:48
  • @user30184 difference between similar generation languages is usually nominal when you compare interpreter vs interpreter (4GL) but not so much when comparing compiled vs compiled (3GL). C++ performs better than C#/VB.net under CRT. The performance difference between python/Ruby/Perl and C++ is enormous; It's the price you pay for simplicity in code as more overhead is required to interpret and validate during run time. – Michael Stimson Sep 06 '17 at 21:15
  • 2
    I mean that it also happens that programmer is using fast language but writes ineffective code. Then the primary solution is not to change the language. This difference 10 sec vs. 9 minutes can't be explained by using something else vs. Python. – user30184 Sep 06 '17 at 21:23
  • @user30184 I think I'd take a stand for myself and state that my code was not ineffective! Unfortunately I think it is just the overhead of using Python. A single arcpy.Clip_analysis() for whatever it is wrapping vs 14 lines of Python using csv, shapefile and shapely. I was just curious if arcpy was processing the points asynchronously. – pstatix Sep 07 '17 at 00:43
  • Please take seriously what is said about spatial index. The fast one probably applies it under the hood and someone else had taken care about programming that routine. – user30184 Sep 07 '17 at 04:38
  • Very little about ArcGIS is asynchronous, the base objects are not thread safe.. only a very few tools split to worker threads and all of the tools that honor the Parallel Processing Factor are new. In order to make ArcGIS 64bit and multiprocessing the base objects (some of which are probably still in FORTRAN and possibly even assembly) would need to be rewritten completely. There aren't many developers that even remember what FORTRAN is and even fewer who would touch assembly. Did you try OGR2OGR? I found it even faster that arcpy.Clip cutting out 10cm contour tiles of 1km x 1km. – Michael Stimson Sep 07 '17 at 05:36
  • 2
    The Python solution in the link takes one multipolygon at a time and tests each point "is it within me", thus 27000 operations for each multipolygon. The solution that utilises spatial index would take a multipolygon, select only points which fall inside its bounding box from the index which is very fast, and runs the "within" test only for those candidates. Much less to test, much faster, in any language. – user30184 Sep 07 '17 at 05:46
  • @user30184 If when spatially indexed (say through a K-d tree), a polygon bounding box is its maximum and minimum extent. Say we have a country with many islands, thus many multipolygons. I only want to clip by the islands. How would one spatially index that against the points (which I can do, its the mapping back to the bounds of the shapes that I am stuck) – pstatix Sep 07 '17 at 11:38
  • What do you mean with islands as many multipolygons? A whole archipelago can be modeled as a single multipolygon but many islands are usually saved as many polygons. Or each island is encoded as a multipolygon which only has one polygon member. If you have an archipelago then you are right that bounding box based spatial index does not help because BBOX may cover all data. – user30184 Sep 07 '17 at 12:04
  • @user30184 That is correct; apologies for the incorrect terminology. I often have a single layer (multipolygon) comprised of n-level polygons. Use of a BBOX would cause problems for points that don't fall within the n-level polygons but say the water instead. So indexing the bounds of the n-level polygons and determining if a point falls within is the final problem. – pstatix Sep 07 '17 at 12:10
  • Well, huge multipolygons are not so easy to handle. I would consider to explode the layer into simple polygons first and find all the points which are within any n-level polygon. N-level polygons are probably quite small and simple spatial indexes like R-Tree should help a lot. This document does not suit for your use case but it may interest you still with a trick that makes analysis 500 times faster http://www.gaia-gis.it/spatialite-3.0.0-BETA1/WorldBorders.pdf. – user30184 Sep 07 '17 at 12:22
  • I have an application that that has to do a bunch of intersects over a large number of features, and switching from arcpy (simple grid index) to python rtree cut the processing time in half. – Marc Pfister Sep 07 '17 at 13:02
  • The given example uses Shapely which uses the C++ GEOS library for the geometry calculations. – Marc Pfister Sep 07 '17 at 16:06
  • @MarcPfister Where can I see an example implementation of an R-Tree in python? – pstatix Sep 07 '17 at 16:47
  • http://toblerity.org/rtree/tutorial.html – Marc Pfister Sep 07 '17 at 16:51
  • @MarcPfister Really meant without having to increase the package count. Sure I could import a module, but I am trying to avoid having to use excessive modules (its a project constraint really). Was looking for the algorithm (doesn't have to be Python but a plus!) – pstatix Sep 07 '17 at 17:33
  • Take a look at https://github.com/karolur/PyQuadTree, it's a simpler type of index but would be easy to add to a project. – Marc Pfister Sep 07 '17 at 17:45

2 Answers2

3

My understanding is that the polygon intersecting algorithms used by the Clip tool are written at a much lower level than Python.

The Clip tool is then made available within the ArcPy site package via a Python wrapper.

PolyGeo
  • 65,136
  • 29
  • 109
  • 338
2

Lack of a spatial index. You're evaluating every point for intersection whether or not it's likely to be a candidate.

If you want to do it in Python, you should use something like Rtree to quickly weed out which points might intersect your clipping polygon. For a pure Python spatial index you could try PyQuadTree.

Marc Pfister
  • 4,097
  • 17
  • 11