3

I have a feature layer that contains about 460,000 records and it currently takes about 20 minutes to read that table using the arcpy.da.TableToNumPyArray() tool. Is there a more efficient way to read the rows to then be able to manipulate these data? It seems like there should be a more "C-like" way to access the rows. Here is the function in it's entirety, though I'm focused on the line near the bottom where I'm calling arcpy.da.TableToNumpyArray() to read out the data:

def import_surface(surface, grid, nrow, ncol, row_col_fields, field_to_convert):
    """
    References raster surface to model grid.
    Returns an array of size nrow, ncol.
    """
    out_table = r'in_memory\{}'.format(surface)
    grid_oid_fieldname = arcpy.Describe(grid).OIDFieldName

    # Compute the mean surface value for each model cell and output a table (~20 seconds)
    arcpy.sa.ZonalStatisticsAsTable(grid, grid_oid_fieldname, surface, out_table, 'DATA', 'MEAN')

    # Build some layers and views
    grid_lyr = r'in_memory\grid_lyr'
    table_vwe = r'in_memory\table_vwe'
    arcpy.MakeFeatureLayer_management(grid, grid_lyr)
    arcpy.MakeTableView_management(out_table, table_vwe)

    grid_lyr_oid_fieldname = arcpy.Describe(grid_lyr).OIDFieldName
    # table_vwe_oid_fieldname = arcpy.Describe(table_vwe).OIDFieldName

    # Join the output zonal stats table with the grid to assign row/col to each value.
    arcpy.AddJoin_management(grid_lyr, grid_lyr_oid_fieldname, table_vwe, 'OID_', 'KEEP_ALL')

    # Take the newly joined grid/zonal stats and read out tuples of (row, col, val) (takes ~20 minutes)   
    a = arcpy.da.TableToNumPyArray(grid_lyr, row_col_fields + [field_to_convert], skip_nulls=False)

    # Reshape the 1D array output by TableToNumpy into a 2D structured array, sorting by row/col (~0.1 seconds)
    a = np.rec.fromrecords(a.tolist(), names=['row', 'col', 'val'])
    a.sort(order=['row', 'col'])
    b = np.reshape(a.val, (nrow, ncol))

    return b
Jason Bellino
  • 4,321
  • 26
  • 35
  • 2
    Do you have a code sample? are you doing it all in numpy? There are many numpy options for querying which I am sure you are aware of, but the source of the time useage needs to be addressed as to whether it is on the arcpy or the numpy side. –  May 11 '15 at 19:00
  • 1
    Why not use the arcpy.da.UpdateCursor? They are pretty fast. – crmackey May 11 '15 at 19:41
  • On the 'C' side you would open a cursor, just like in python, and then build a list (just as crmackey said) which is probably what arcpy.da.TableToNumPyArray does. If you want to go faster you're (probably) going to have to go non-Esri to read the table and that depends on how they're stored. – Michael Stimson May 11 '15 at 21:07
  • @DanPatterson, I've added sample code to the question...I'm really just interested in finding a better way to read the data out of my feature layer than using the TableToNumpyArray tool in the DataAccess module of arcpy. I've tried using a search cursor to read row by row and make a list, but that appears to be equally as arduous if not more so. – Jason Bellino May 11 '15 at 21:22
  • @MichaelMiles-Stimson, I was thinking a non-ESRI alternative might be the only way, however these data are stored in an ESRI geodatabase and I'm using arcpy to manipulate into a feature layer in memory. I suppose I could write this out to a shapefile and read it using a different set of tools. – Jason Bellino May 11 '15 at 21:24
  • file or personal? personal is just a Microsoft JET (Access) database with Esri tables... you can open and manipulate in Access (if you know what you're doing and don't break it). – Michael Stimson May 11 '15 at 21:28
  • @MichaelMiles-Stimson, I'm using a file GDB - I don't think arcpy in 10.2 supports accessing personal GDBs with background geoprocessing on. There's gotta be a better solution - I really need this to be an automated process where I don't need to intervene and create intermediary files. – Jason Bellino May 11 '15 at 21:31
  • 2
    First thoughts are the use of in_memory and the use of joined data. I would try to make the join permanent saving to a file on local disk and if that speeds things up then doing the same but saving to in_memory. This should be done before converting to an array. It could be the join or working location or a combination of both. –  May 11 '15 at 21:32
  • 1
    It's 64bit not background processing that prohibits use of JET databases.. Dan is definitely on to something there, persist the join to disc/in_memory and then load it. I suspect that the problem is that the 'toNumPy' needs to lookup each value in a (possibly non-indexed) table to load it into memory. You could do that faster if you copied your join table to in_memory and indexed it on the join field then persisted it before loading.. how big is your join table? – Michael Stimson May 11 '15 at 21:43
  • @DanPatterson, I copied the joined features to disk and TableToNumpyArray (T2NP) ran in 6 seconds - huge improvement there! However, it took 40 minutes (twice as long as T2NP) to copy the features to disk... – Jason Bellino May 12 '15 at 14:03
  • @MichaelMiles-Stimson, the table is about 70 MB with around 460,000 rows so it's not huge. I tried using arcpy.CopyFeatures_management() to an in_memory location, but that copy takes about 40 minutes which doubles the time to process even though the TableToNumpyArray on the resulting table now takes 0.2 seconds. – Jason Bellino May 12 '15 at 20:31

2 Answers2

3

Thought to comment first, but then it got large..

  • Processing that much data in 20 min is reasonable time imo. I've also tried to speed up some of the arcpy operations - look here.
  • If performance is an issue, you could read the file geodatabase with API, but I doubt it would be any faster than reading the data with arcpy into a pure Python data structure such as a dict or a list of tuples.
  • It depends a bit on the data structure you have in your gdb table - looking for keys is faster than looking for values in a Python dict and list comprehensions in some situations can significantly increase the speed of processing. Search for operations are costly in Python and find more optimal approaches.

  • I always run 64bit Python for any large data processing - it gets processed ca 10% on those datasets I've used.

  • You could play with SQLite loading data into a database and running some SQL - it might be faster, but it's hard to verify that without testing without yout data.

Finally, if you don't have it yet - get an SSD hard disk - after I've switched to SSD everything just started flying (now it gets slow again :) - because you get used to it ).

Alex Tereshenkov
  • 29,912
  • 4
  • 54
  • 119
  • Thanks for the suggestions, Alex. The reason I'm surprised at the length of time it takes to read the table (~20 minutes) is because I can run zonal statistics on the dataset in ~20 seconds. I will have to look into getting an SSD for my system! – Jason Bellino May 12 '15 at 13:17
  • I understand... Test with SSD, there should be a notable boost. – Alex Tereshenkov May 12 '15 at 17:21
  • If you don't want to spend any money and have heaps of RAM installed you could create a RAM drive, it's even faster than a SSD. 70MB is something that I could spare even on WinXP 32bit.. if you're at the limit of what you can achieve with code then your only options are to use a different API (which may or may not help) and/or look at your performance bottlenecks, which is usually hard drive access. Note that a noticeable increase can be gained by having all your data on a different physical disc to your system drive, not needing to share bandwidth with system calls. – Michael Stimson May 12 '15 at 21:28
1

The main bottle neck in my previous code was in reading from a GDB featureclass. Once I converted to a shapefile, my read time dropped to about 10% of the original. I should have remembered that I had the same problem just a few months ago #facepalm. Additionally, in a previous iteration of the current problem, I tried to copy my features to an in_memory location which did not yield an improvement in processing time. However, I tried it again, perhaps after resetting the iPython Notebook kernel, and the processing time became much more reasonable.

Here is the new code which runs in less than 2 minutes:

def import_surface(surface, grid, nrow, ncol,
                   zstat_field=None,
                   join_field=None,
                   row_col_fields=('grid_row', 'grid_col'),
                   field_aliasname_to_convert='MEAN'):
    """
    References raster surface to model grid and returns an array of size nrow, ncol.
    vars:
        surface: the raster surface to be sampled; raster
        grid: the vector grid upon which the raster will be summarized; feature class
        nrow: number of rows in the grid; int
        ncol: number of columns in the grid; int
        zstat_field: field in the grid that defines the zones; str
        join_field: field in the resulting features to be used for the join; str
        row_col_fields: names of the fields containing row and column numbers; list
        field_aliasname_to_convert: alias of resulting zstat field to use; str
    """
    if join_field is None:
        join_field = arcpy.Describe(grid).OIDFieldName
    if zstat_field is None:
        zstat_field = join_field

    zstat = r'in_memory\{}'.format(surface)
    arcpy.sa.ZonalStatisticsAsTable(grid, zstat_field, surface, zstat, 'NODATA', 'MEAN')


    # Create feature layers and table views for joining
    grid_lyr = r'in_memory\grid_lyr'
    arcpy.MakeFeatureLayer_management(grid, grid_lyr)

    zstat_vwe = r'in_memory\zstat_vwe'
    arcpy.MakeTableView_management(zstat, zstat_vwe)


    # Join tables
    arcpy.AddJoin_management(grid_lyr, join_field, zstat_vwe, join_field, 'KEEP_ALL')


    # Write the grid features to a new featureclass
    zstat_grid = r'in_memory\zstat_grid'
    arcpy.CopyFeatures_management(grid_lyr, zstat_grid)


    # Ensure we point to the correct zstat field name in case of truncation
    for idx, alias in  enumerate([f.aliasName for f in arcpy.ListFields(zstat_grid)]):
        if alias == field_aliasname_to_convert:
            name = [f.name for f in arcpy.ListFields(zstat_grid)][idx]
            break


    # Convert regular-grid polygon shapefile to an array.
    a = arcpy.da.TableToNumPyArray(zstat_grid, row_col_fields + (name, ), skip_nulls=False)


    # Convert to recarray
    a_1 = np.rec.fromrecords(a.tolist(), names=['row', 'col', 'val'])
    a_1.sort(order=['row', 'col'])
    b = np.reshape(a_1.val, (nrow, ncol))
    return b
Jason Bellino
  • 4,321
  • 26
  • 35