Streamlining Python Code for Big Data

Question

I have Python code which is designed to take point shapefiles through the following workflow:

Merge points
Integrate points, such that any points within 1 m of each other become one point
Create Feature layer, where points with z < 10 are selected
Buffer Points
Polygon to raster 1m resolution
Reclassify, where 1 - 9 = 1; NoData = 0

Each shapefile has approximately 250,000 to 350,000 points covering ~ 5x7 km. Point data used as inputs represent tree locations. Each point (i.e. tree) has an associated "z" value which represents crown radius and is used in the buffer process. My intent is to use the final binary output in a separate process to produce a raster describing canopy cover.

I ran a test with four shapefiles and it produced a 700MB raster and took 35 minutes (i5 processor & 8GB RAM). Seeing that I will need to run this process on 3500 shapefiles, I would appreciate any advice to streamline the process (see attached code).Generally speaking, what is the best way to deal with geoprocessing big data? More specifically, are there any tweaks to the code or workflow that could help increase efficiency?

Edit:

Time (% of total) for geoprocessing tasks:

Merge = 7.6%
Integrate = 7.1%
Feature to Lyr = 0
Buffer = 8.8%
Poly to Raster = 74.8%
Reclassify = 1.6%

# Import arcpy module
import arcpy

# Check out any necessary licenses
arcpy.CheckOutExtension("spatial")

# Script arguments
temp4 = arcpy.GetParameterAsText(0)
if temp4 == '#' or not temp4:
    temp4 = "C:\\gdrive\\temp\\temp4" # provide a default value if unspecified

Reclassification = arcpy.GetParameterAsText(1)
if Reclassification == '#' or not Reclassification:
    Reclassification = "1 9 1;NODATA 0" # provide a default value if unspecified

Multiple_Value = arcpy.GetParameterAsText(2)
if Multiple_Value == '#' or not Multiple_Value:
    Multiple_Value = "C:\\t1.shp;C:\\t2.shp;C:\\t3.shp;C:\\t4.shp" # provide a default value if unspecified

# Local variables:
temp_shp = Multiple_Value
Output_Features = temp_shp
temp2_Layer = Output_Features
temp_Buffer = temp2_Layer
temp3 = temp_Buffer

# Process: Merge
arcpy.Merge_management(Multiple_Value, temp_shp, "x \"x\" true true false 19 Double 0 0 ,First,#,C:\\#########omitted to save space

# Process: Integrate
arcpy.Integrate_management("C:\\gdrive\\temp\\temp.shp #", "1 Meters")

# Process: Make Feature Layer
arcpy.MakeFeatureLayer_management(temp_shp, temp2_Layer, "z <10", "", "x x VISIBLE NONE;y y VISIBLE NONE;z z VISIBLE NONE;Buffer Buffer VISIBLE NONE")

# Process: Buffer
arcpy.Buffer_analysis(temp2_Layer, temp_Buffer, "z", "FULL", "ROUND", "NONE", "")

# Process: Polygon to Raster
arcpy.PolygonToRaster_conversion(temp_Buffer, "BUFF_DIST", temp3, "CELL_CENTER", "NONE", "1")

# Process: Reclassify
arcpy.gp.Reclassify_sa(temp3, "Value", Reclassification, temp4, "DATA")

It may be worth putting in some performance timing code to establish whether the bulk of the time is going into one or a few steps - so that those can be focussed on to try and find performance gains — PolyGeo, Aug 05 '12 at 00:23
I don't think you have many options with to improve the performance if you keep using ArcPy. Maybe you can look at other tools to do this? Tools like FME or maybe postgis? — tmske, Aug 06 '12 at 11:56
From a hardware perspective, a Solid State Drive would probably provide a better performance increase than any other option assuming the pagefile is never being hit with only 8GB of ram. — MLowry, Aug 06 '12 at 15:38
@PolyGeo: Good idea. I updated the post with the timing results. — Aaron, Aug 06 '12 at 17:15
It's not clear what pixel type is being used, but if it is "Byte" (which it should be), then the raw data storage would be 5000x7000=35Mb (~33.4MB) per raster, which actually isn't that big. However, the next dimension of 3500 (time dimension?) increases the total raw size to ~114GB. — Mike T, Aug 06 '12 at 19:12
@Mike: Thanks for catching that. The pixel depth/type is 4 Bit unsigned integer. In fact, the original output was a GRID which produced a 600+MB raster. I recently changed the output to .tif, which produced a more reasonable ~70MB output--corresponding to your calculations. — Aaron, Aug 06 '12 at 19:45
Is "z" in this the elevation of the point? Or is the count of the points that were integrated? I am thinking the goal should be to eliminate as much data as possible for later steps, and run the expense step as few times as possible. — blord-castillo, Aug 07 '12 at 12:05
@blord-castillo: In this case, "z" is a diameter measurement. — Aaron, Aug 07 '12 at 12:51
@blord-castillo: z is calculated from a separate algorithm prior to running this model. — Aaron, Aug 07 '12 at 13:36
Although I cannot tell from this description what the algorithm is doing (or intended to do), in most cases point buffers followed by rasterization should be replaced by rasterization of the points followed by a focal statistic (usually a mean or sum). The result will be the same but, by avoiding the lengthy buffering and poly rasterization steps, it will be obtained much faster. I (strongly) suspect substantial additional speedups could be obtained but cannot provide specific advice due to the vagueness of the description of the procedure. — whuber, Aug 07 '12 at 16:05
The buffers around the points are variable size (based on the z value of the point). I -think- to still do focal statistics, you would have to split result points up by z value and do raster and focal stats on each set (using z as the radius on a circular neighborhood with maximum as the stat). Then run Cell Statistics across all 9 rasters with the max stat in order to combine the results together. (Which is still probably a lot faster than buffer and rasterize with a big data set.) — blord-castillo, Aug 07 '12 at 17:50
Thanks @blord, for the clarification. If there's no chance of buffers overlapping, there's a clever hack: compute a kernel density based on an appropriate function of z, using a "smooth" kernel, and then select values exceeding an appropriate threshold. Conceivably something like this could be used to effect the initial "merge" of the points, too, so that everything might be accomplished in two quick raster calculations. — whuber, Aug 07 '12 at 18:34
I assume the integrate is there because there is considerable overlap of the points' radii. With a max radius of 9 (m?) on the buffers and an integrate distance of 1m, I would think there has to be overlap on the buffers too even after the integrate. I think we need more info on what this is trying to accomplish from what source data. — blord-castillo, Aug 07 '12 at 18:50
@blord-castillo & whuber: Input data consists of a point shapefile where each point has an x-y coord and a z-value representing a radius. As you both mentioned, the z-value associated with each point has a variable radius from 0 - 9--buffers do overlap in areas. I know that the few points with z-values over 9 are errors, so they are omitted in the MakeFeatureLayer step. The algorithm that assigns an x-y coordinate and z-value to the features of interest sometimes creates duplicate points within 1m of eachother on overlapping sections, which is why the Integrate command is used. — Aaron, Aug 07 '12 at 19:59
Have you tried moving your shapefiles to feature classes in a geodatabase? How about saving intermediate data to in memory work spaces? http://resources.arcgis.com/en/help/main/10.1/index.html#//002w0000005s000000 — TurboGus, Aug 07 '12 at 20:12
It looks as though you're rasterising to a resolution of 1m. Is that aligned / snapped to the underlying coordinate system grid? If so you would shave off some time by working with the x y and z values of the points as attributes, rounding the x and y values to the centres of each grid cell, and then summarising to remove duplicates and extract the maximum z value from each stack. You could even convert directly into raster cell offsets, create the raster as a numpy structure and then create the buffers using raster kernel equivalents without touching Spatial Analyst functions. — Andy Harfoot, Aug 08 '12 at 08:38
Aaron, I still feel it would be extremely beneficial to have an English description of what your data are, what they mean, what you are trying to accomplish, and how you would interpret the results. You are hamstringing yourself by obliging readers to (a) know all the details of the code in your model and (b) take, for a point of departure, a solution known to be inefficient, rather than using your ultimate goals to direct the thinking. That's why you're getting so many different comments and suggestions. — whuber, Aug 08 '12 at 15:12

score 14 · Answer 1 · edited Dec 10 '22 at 19:57

The first thing I would do is monitor your system's resource utilization using something like Resource Monitor in Windows 7 or perfmon in Vista/XP to get a feel for whether you are CPU-, memory- or IO-bound.

If you are memory or IO-bound there is likely very little that you can do but upgrade hardware, reduce the problem size, or change the approach entirely.

If you determine that you are CPU-bound, I would experiment with the multiprocessing module, or one of the many other Python-based parallel processing packages available, to see if you can use more CPU cores to speed up your operations.

The trick to multiprocessing and parallelism in general is finding a good partitioning scheme that:

Allows you to split up the inputs into smaller working sets, then recombine the results in a way that makes sense,
Adds the least amount of overhead (some is unavoidable compared to serial processing), and
Allows you to adjust the size of the working set to best utilize the system's resources for optimal performance.

You can use the script I created in this answer as a starting point: Porting Avenue code for Producing Building Shadows to ArcPy/Python for ArcGIS Desktop?

See also this ESRI Geoprocessing blog post on the subject: Python Multiprocessing – Approaches and Considerations

I think that your case is going to be even more challenging due to the more "black box" nature of the tools you are using, rather than the more fine-grained geometry arrays I was working with. Perhaps working with NumPy arrays may come in handy.

I also came across some interesting reading material if you wanted to look beyond arcpy:

Parallel Processing Algorithms for GIS. CRC Press, 1997. Richard Healey, Steve Dowers, Bruce Gittings, Mike J Mineter
Accelerating Raster Processing with Fine and Coarse Grain Parallelism in GRASS. Proceedings of the FOSS/GRASS Users Conference 2004 - Bangkok, Thailand, 12-14 September 2004. Onil Nazra Persada, Thierry Goubier.
Accelerating batch processing of spatial raster analysis using GPU. Computers & Geosciences, Volume 45, August 2012, Pages 212–220. Mathias Steinbach, Reinhard Hemmerling

blord-castillo · Accepted Answer · 2012-08-07T17:53:27.183

Some algorithm changes that should help you.

Execute your selection first before the merge or integrate. This will cut down significantly on the later functions that are most expensive.

Merge and integrate are both memory expensive, so you want to keep eliminating features as you bring in feature classes, and try to do your merges in a binary tree to keep the size of the merges and integrates down. e.g. for four shapefiles you merge two shapefiles and integrate; merge two more shapefiles and integrate; merge the two resulting feature classes and integrate.

Your job queue starts as a queue of shapefile references. You also have a result queue to place results into. The run() method for your parallel processing worker will do these operations: Take two items off the queue. If no item is taken (queue is empty), terminate the worker. If one item is taken, put that item straight into the result queue.

If two items are taken, for each item: if it is a shapefile, select for z < 10 and create an in_memory feature class; else, it is already an in_memory feature class and skips the selection step. Merge the two in_memory feature classes to create a new in_memory feature class. Delete the original two feature classes. Execute integrate on the new feature class. Place that feature class into the result queue.

Then run an outer while loop. The loop starts with the shapefile queue and tests for length greater than 1. It then runs the queue through the workers. If the result queue has a length greater than 1, the while loop executes another parallel processing run through the workers until the result queue is 1 in_memory feature class.

e.g. If you start with 3500 shapefiles, your first queue will have 3500 jobs. Second will have 1750 jobs. 875, 438, 219, 110, 55, 28, 14, 7, 4, 2, 1. Your big bottleneck will be memory. If you do not have enough memory (and you will run out of memory in the create of the first result queue if that is the case), then modify your algorithm to merge more than 2 feature classes at once then integrate, which will cut down the size of your first result queue in exchange for longer processing time. Optionally, you could write output files and skip using in_memory feature classes. This will slow you down considerably, but would get past the memory bottleneck.

Only after you have performed merge and integrate on all of the shapefiles, ending with one single feature class, do you then perform the buffer, poly to raster, and reclassify. That way those three operations are performed only once and you keep your geometry simple.

+1 for using the in_memory workspace if your data will fit into memory. It significantly speeds up geoprocessing operations. — RyanKDalton, Aug 07 '12 at 22:23
This is good stuff. I think it could be even better with a diagram and some psuedocode (or real code!). — blah238, Aug 09 '12 at 21:40
Yeah, I wish I had some time to commit to code. I need to write up a new parallel processing demo script anyway. — blord-castillo, Aug 10 '12 at 00:09

Streamlining Python Code for Big Data

2 Answers2

Linked