Efficient ways to read table data from a large feature layer?

Question

I have a feature layer that contains about 460,000 records and it currently takes about 20 minutes to read that table using the arcpy.da.TableToNumPyArray() tool. Is there a more efficient way to read the rows to then be able to manipulate these data? It seems like there should be a more "C-like" way to access the rows. Here is the function in it's entirety, though I'm focused on the line near the bottom where I'm calling arcpy.da.TableToNumpyArray() to read out the data:

def import_surface(surface, grid, nrow, ncol, row_col_fields, field_to_convert):
    """
    References raster surface to model grid.
    Returns an array of size nrow, ncol.
    """
    out_table = r'in_memory\{}'.format(surface)
    grid_oid_fieldname = arcpy.Describe(grid).OIDFieldName

    # Compute the mean surface value for each model cell and output a table (~20 seconds)
    arcpy.sa.ZonalStatisticsAsTable(grid, grid_oid_fieldname, surface, out_table, 'DATA', 'MEAN')

    # Build some layers and views
    grid_lyr = r'in_memory\grid_lyr'
    table_vwe = r'in_memory\table_vwe'
    arcpy.MakeFeatureLayer_management(grid, grid_lyr)
    arcpy.MakeTableView_management(out_table, table_vwe)

    grid_lyr_oid_fieldname = arcpy.Describe(grid_lyr).OIDFieldName
    # table_vwe_oid_fieldname = arcpy.Describe(table_vwe).OIDFieldName

    # Join the output zonal stats table with the grid to assign row/col to each value.
    arcpy.AddJoin_management(grid_lyr, grid_lyr_oid_fieldname, table_vwe, 'OID_', 'KEEP_ALL')

    # Take the newly joined grid/zonal stats and read out tuples of (row, col, val) (takes ~20 minutes)   
    a = arcpy.da.TableToNumPyArray(grid_lyr, row_col_fields + [field_to_convert], skip_nulls=False)

    # Reshape the 1D array output by TableToNumpy into a 2D structured array, sorting by row/col (~0.1 seconds)
    a = np.rec.fromrecords(a.tolist(), names=['row', 'col', 'val'])
    a.sort(order=['row', 'col'])
    b = np.reshape(a.val, (nrow, ncol))

    return b

Do you have a code sample? are you doing it all in numpy? There are many numpy options for querying which I am sure you are aware of, but the source of the time useage needs to be addressed as to whether it is on the arcpy or the numpy side. — , May 11 '15 at 19:00
Why not use the arcpy.da.UpdateCursor? They are pretty fast. — crmackey, May 11 '15 at 19:41
On the 'C' side you would open a cursor, just like in python, and then build a list (just as crmackey said) which is probably what arcpy.da.TableToNumPyArray does. If you want to go faster you're (probably) going to have to go non-Esri to read the table and that depends on how they're stored. — Michael Stimson, May 11 '15 at 21:07
@DanPatterson, I've added sample code to the question...I'm really just interested in finding a better way to read the data out of my feature layer than using the TableToNumpyArray tool in the DataAccess module of arcpy. I've tried using a search cursor to read row by row and make a list, but that appears to be equally as arduous if not more so. — Jason Bellino, May 11 '15 at 21:22
@MichaelMiles-Stimson, I was thinking a non-ESRI alternative might be the only way, however these data are stored in an ESRI geodatabase and I'm using arcpy to manipulate into a feature layer in memory. I suppose I could write this out to a shapefile and read it using a different set of tools. — Jason Bellino, May 11 '15 at 21:24
file or personal? personal is just a Microsoft JET (Access) database with Esri tables... you can open and manipulate in Access (if you know what you're doing and don't break it). — Michael Stimson, May 11 '15 at 21:28
@MichaelMiles-Stimson, I'm using a file GDB - I don't think arcpy in 10.2 supports accessing personal GDBs with background geoprocessing on. There's gotta be a better solution - I really need this to be an automated process where I don't need to intervene and create intermediary files. — Jason Bellino, May 11 '15 at 21:31
First thoughts are the use of in_memory and the use of joined data. I would try to make the join permanent saving to a file on local disk and if that speeds things up then doing the same but saving to in_memory. This should be done before converting to an array. It could be the join or working location or a combination of both. — , May 11 '15 at 21:32
It's 64bit not background processing that prohibits use of JET databases.. Dan is definitely on to something there, persist the join to disc/in_memory and then load it. I suspect that the problem is that the 'toNumPy' needs to lookup each value in a (possibly non-indexed) table to load it into memory. You could do that faster if you copied your join table to in_memory and indexed it on the join field then persisted it before loading.. how big is your join table? — Michael Stimson, May 11 '15 at 21:43
@DanPatterson, I copied the joined features to disk and TableToNumpyArray (T2NP) ran in 6 seconds - huge improvement there! However, it took 40 minutes (twice as long as T2NP) to copy the features to disk... — Jason Bellino, May 12 '15 at 14:03
@MichaelMiles-Stimson, the table is about 70 MB with around 460,000 rows so it's not huge. I tried using arcpy.CopyFeatures_management() to an in_memory location, but that copy takes about 40 minutes which doubles the time to process even though the TableToNumpyArray on the resulting table now takes 0.2 seconds. — Jason Bellino, May 12 '15 at 20:31

score 3 · Answer 1 · edited Apr 13 '17 at 12:34

3

Thought to comment first, but then it got large..

Processing that much data in 20 min is reasonable time imo. I've also tried to speed up some of the arcpy operations - look here.
If performance is an issue, you could read the file geodatabase with API, but I doubt it would be any faster than reading the data with arcpy into a pure Python data structure such as a dict or a list of tuples.
It depends a bit on the data structure you have in your gdb table - looking for keys is faster than looking for values in a Python dict and list comprehensions in some situations can significantly increase the speed of processing. Search for operations are costly in Python and find more optimal approaches.
I always run 64bit Python for any large data processing - it gets processed ca 10% on those datasets I've used.
You could play with SQLite loading data into a database and running some SQL - it might be faster, but it's hard to verify that without testing without yout data.

Finally, if you don't have it yet - get an SSD hard disk - after I've switched to SSD everything just started flying (now it gets slow again :) - because you get used to it ).

edited Apr 13 '17 at 12:34

Community

1

answered May 11 '15 at 21:44

Alex Tereshenkov

29,912
4
54
119

Thanks for the suggestions, Alex. The reason I'm surprised at the length of time it takes to read the table (~20 minutes) is because I can run zonal statistics on the dataset in ~20 seconds. I will have to look into getting an SSD for my system! – Jason Bellino May 12 '15 at 13:17
I understand... Test with SSD, there should be a notable boost. – Alex Tereshenkov May 12 '15 at 17:21
If you don't want to spend any money and have heaps of RAM installed you could create a RAM drive, it's even faster than a SSD. 70MB is something that I could spare even on WinXP 32bit.. if you're at the limit of what you can achieve with code then your only options are to use a different API (which may or may not help) and/or look at your performance bottlenecks, which is usually hard drive access. Note that a noticeable increase can be gained by having all your data on a different physical disc to your system drive, not needing to share bandwidth with system calls. – Michael Stimson May 12 '15 at 21:28

score 1 · Accepted Answer · edited Apr 13 '17 at 12:33

The main bottle neck in my previous code was in reading from a GDB featureclass. Once I converted to a shapefile, my read time dropped to about 10% of the original. I should have remembered that I had the same problem just a few months ago #facepalm. Additionally, in a previous iteration of the current problem, I tried to copy my features to an in_memory location which did not yield an improvement in processing time. However, I tried it again, perhaps after resetting the iPython Notebook kernel, and the processing time became much more reasonable.

Here is the new code which runs in less than 2 minutes:

def import_surface(surface, grid, nrow, ncol,
                   zstat_field=None,
                   join_field=None,
                   row_col_fields=('grid_row', 'grid_col'),
                   field_aliasname_to_convert='MEAN'):
    """
    References raster surface to model grid and returns an array of size nrow, ncol.
    vars:
        surface: the raster surface to be sampled; raster
        grid: the vector grid upon which the raster will be summarized; feature class
        nrow: number of rows in the grid; int
        ncol: number of columns in the grid; int
        zstat_field: field in the grid that defines the zones; str
        join_field: field in the resulting features to be used for the join; str
        row_col_fields: names of the fields containing row and column numbers; list
        field_aliasname_to_convert: alias of resulting zstat field to use; str
    """
    if join_field is None:
        join_field = arcpy.Describe(grid).OIDFieldName
    if zstat_field is None:
        zstat_field = join_field

    zstat = r'in_memory\{}'.format(surface)
    arcpy.sa.ZonalStatisticsAsTable(grid, zstat_field, surface, zstat, 'NODATA', 'MEAN')


    # Create feature layers and table views for joining
    grid_lyr = r'in_memory\grid_lyr'
    arcpy.MakeFeatureLayer_management(grid, grid_lyr)

    zstat_vwe = r'in_memory\zstat_vwe'
    arcpy.MakeTableView_management(zstat, zstat_vwe)


    # Join tables
    arcpy.AddJoin_management(grid_lyr, join_field, zstat_vwe, join_field, 'KEEP_ALL')


    # Write the grid features to a new featureclass
    zstat_grid = r'in_memory\zstat_grid'
    arcpy.CopyFeatures_management(grid_lyr, zstat_grid)


    # Ensure we point to the correct zstat field name in case of truncation
    for idx, alias in  enumerate([f.aliasName for f in arcpy.ListFields(zstat_grid)]):
        if alias == field_aliasname_to_convert:
            name = [f.name for f in arcpy.ListFields(zstat_grid)][idx]
            break


    # Convert regular-grid polygon shapefile to an array.
    a = arcpy.da.TableToNumPyArray(zstat_grid, row_col_fields + (name, ), skip_nulls=False)


    # Convert to recarray
    a_1 = np.rec.fromrecords(a.tolist(), names=['row', 'col', 'val'])
    a_1.sort(order=['row', 'col'])
    b = np.reshape(a_1.val, (nrow, ncol))
    return b

Efficient ways to read table data from a large feature layer?

2 Answers2

Linked