3

I have seen some difficult ways to do parallel processing, but I wonder if it is possible to simply execute multiple process of the same ArcPy script at the same time.

My script makes some changes to the default geodatabase, so I thought of making a geodatabase copy for each process.

I have updated the script to copy the shared resources between the processes, so it copies the geodatabase and the mxd's related to it.

I have made a test to parallelize, using this script:

pool = multiprocessing.Pool(2)

pool.map(test_func, [1, 2] , 1)

pool.close()

I noticed when I browse RAM and CPU use, every process consumes 200Mb. So, if I have 6 Gb of RAM, I think I have to exploit 5Gb RAM of it by enlarging the pool size to:

5000 / 200 = 25

So, to exploit the whole power of the machine, I think I should use 25 as pool size.

I need to know if this is the best manner, or how I could measure the efficiency of this parallelization.

This an example of the code that I'm trying to parallelize. The whole script contains 1500 lines of code almost like this one:

def dora_layer_goned():
    arcpy.Select_analysis( "layer_goned" , "layer_goned22" )
    arcpy.MakeFeatureLayer_management("layer_goned", "layer_goned_lyr")
    arcpy.SelectLayerByLocation_management("layer_goned_lyr" ,"WITHIN", "current_parcel"  ,  "" , "NEW_SELECTION")
    arcpy.SelectLayerByAttribute_management("layer_goned_lyr" , "SWITCH_SELECTION" )
    arcpy.Select_analysis("layer_goned_lyr" , "layer_goned_2_dora2" )
    arcpy.Clip_analysis("layer_goned_lyr" , "current_parcel_5m_2" , "layer_goned_2_dora" )
    arcpy.SelectLayerByAttribute_management("layer_goned_lyr" , "CLEAR_SELECTION" )
    arcpy.FeatureToPoint_management("layer_goned_2_dora","layer_goned_2_dora_point","CENTROID")
    arcpy.MakeFeatureLayer_management("layer_goned_2_dora2", "layer_goned_2_dora_lyr")
    arcpy.SelectLayerByLocation_management("layer_goned_2_dora_lyr" ,"INTERSECT", "layer_goned_2_dora_point"  ,  "" , "NEW_SELECTION")
    arcpy.DeleteFeatures_management("layer_goned_2_dora_lyr")
    arcpy.FeatureVerticesToPoints_management("current_parcel","current_parcel__point", "ALL")
    arcpy.FeatureVerticesToPoints_management("carre_line","carre_line__point", "ALL")
    arcpy.CalculateField_management("current_parcel__point","id","!objectid!","PYTHON_9.3")
    arcpy.SpatialJoin_analysis("carre_line__point"  , "current_parcel__point" , "carre_line__point_sj","JOIN_ONE_TO_ONE" , "KEEP_COMMON" , "" , "CLOSEST")
    arcpy.Append_management("current_parcel__point" , "carre_line__point_sj" , "NO_TEST") #
    arcpy.PointsToLine_management("carre_line__point_sj", "carre_line__point_sj_line", "id")
    arcpy.Buffer_analysis("carre_line__point_sj_line" , "carre_line__point_sj_line_buf" , 0.2)
    arcpy.Erase_analysis("layer_goned_2_dora2" , "carre_line__point_sj_line_buf"  , "layer_goned_2_dora_erz")
    arcpy.MultipartToSinglepart_management("layer_goned_2_dora_erz" , "layer_goned_2_dora_erz_mono")
    arcpy.MakeFeatureLayer_management("layer_goned_2_dora_erz_mono", "layer_goned_2_dora_erz_lyr")
    arcpy.SelectLayerByLocation_management("layer_goned_2_dora_erz_lyr" ,"SHARE_A_LINE_SEGMENT_WITH", "current_parcel"  ,  "" , "NEW_SELECTION")
    arcpy.SelectLayerByLocation_management("layer_goned_lyr" ,"CONTAINS", "layer_goned_2_dora_erz_lyr"  ,  "" , "NEW_SELECTION")
    arcpy.DeleteFeatures_management("layer_goned_lyr")
    arcpy.Append_management("layer_goned_2_dora_erz_lyr" , "layer_goned" , "NO_TEST") #
    arcpy.SelectLayerByAttribute_management("layer_goned_2_dora_erz_lyr" , "CLEAR_SELECTION" )
PolyGeo
  • 65,136
  • 29
  • 109
  • 338
geogeek
  • 4,566
  • 5
  • 35
  • 79
  • What exactly are you trying to parallelize? Do you have any idea whether you are CPU or IO-bound? – blah238 Mar 21 '13 at 01:14
  • i'm trying to parallelize using CPU, because i have'nt big DATA problem. but for processing each parcel i need a huge amount of time, so i thought to parallelize by duplicating the ressources used for the processing. – geogeek Mar 21 '13 at 08:58
  • 1
    This isn't a duplicate of the suggested question. This one deals with arcpy, and the other one has a 1 barely useful answer concerned with ArcObjects. – Devdatta Tengshe Mar 21 '13 at 16:27
  • 1
    I'm trying to get you to describe the actual process you're trying to speed up. Not everything can be easily parallelized. – blah238 Mar 21 '13 at 17:37
  • i have made an edit for more details about the problem. – geogeek Mar 21 '13 at 19:36
  • You still have not explained the actual process (the contents of your test_func) that you are trying to speed up. Also, your hypothesis of using 25 as the pool size is flawed because you seem to be under the false impression that using all of your system's memory will automatically help with performance. This is not necessarily true, and indeed, most likely would result in decreased performance as the pagefile starts to be hit (also keep in mind each 32-bit process is limited to 2GB by default). You should be seeking to minimize overall processing time, not maximizing memory usage. – blah238 Mar 21 '13 at 21:58
  • @Devdatta, perhaps that was not the best duplicate question choice, but because this type of question has been asked before on this site many times I figured it was good enough. – blah238 Mar 21 '13 at 22:53
  • @blah238 so sorry i cannot include the function it contains 1500 lines of code, all functions in this code are vector geoprocessing functions. but this function take 10 minutes for its execution. – geogeek Mar 21 '13 at 23:16
  • i would like to know if 2GB limite is for each 32bit process or for all child processes of 32bit process. if i can move to arcgis 10.1 SP1, i can work with 64bit geoprocessing ?? in this case ? – geogeek Mar 21 '13 at 23:20
  • 1
    Refer to: http://msdn.microsoft.com/en-us/library/windows/desktop/aa366778%28v=vs.85%29.aspx – blah238 Mar 21 '13 at 23:52
  • 1
    I think the point I'm trying to make is that if you are CPU-bound that you should only use as many processes as you have CPUs (subtract one for background activity). Are you CPU or IO-bound? This is why I keep asking what your process is. If it's mostly just shifting data around on disk then there is likely to be no benefit to be gained from parallelization. – blah238 Mar 22 '13 at 00:03
  • i have tried to parallelize 2 processes but instead of being executed in 10 minutes like one process normally do, the hay taken 15 minutes, i could know from this result if i should invest in parallelization ? – geogeek Mar 22 '13 at 00:14
  • It's hard to say without seeing your code and performance/resource counters, but it sounds like you are IO-bound. I would look at improving the efficiency of your code to reduce the amount of time spent reading and writing to disk before trying to use multiprocessing. – blah238 Mar 22 '13 at 00:45
  • i have made an update with a code example from the script that i want to parallelize. – geogeek Mar 22 '13 at 11:58
  • Yeah, as I suspected, that does not look like something you would get much benefit from multiprocessing. It's just a bunch of chunky GP tool calls, most of which create a new output on disk so I'm thinking you're definitely IO-bound. If you want help/suggestions on optimizing your process then start a new question with the code you want help with and what it's supposed to be doing. – blah238 Mar 22 '13 at 16:32
  • yes i need help, this is new question http://gis.stackexchange.com/questions/55234/optimize-io-bound-arcpy-script – geogeek Mar 22 '13 at 17:25

2 Answers2

5

See this blog post, it should cover it

http://blogs.esri.com/esri/arcgis/2012/09/26/distributed-processing-with-arcgis-part-1/

gotchula
  • 3,203
  • 1
  • 16
  • 18
3

Just use the following function

def run_MultiPros(function, variables):
    """<function, variables> Execute a process on multiple processors.
    INPUTS:
    function(required) Name of the function to be executed.
    variables(required) Variable to be passed to function.
    Description: This function will run the given fuction on to multiprocesser. Total number of jobs is equal to number of variables.        
    """
    pool = multiprocessing.Pool()
    pool.map(function, variables)
    pool.close()
    pool.join()
iRfAn
  • 1,914
  • 13
  • 18