Looping through 16 million records using ArcPy?

Question

I have a table with 8 columns and ~16.7 million records. I need to run a set of if-else equations on the columns. I've written a script using UpdateCursor module, but after a few million records it runs out of memory. I was wondering if there is a better way to process these 16.7 million records.

import arcpy

arcpy.TableToTable_conversion("combine_2013", "D:/mosaic.gdb", "combo_table")

c_table = "D:/mosaic.gdb/combo_table"

fields = ['dev_agg', 'herb_agg','forest_agg','wat_agg', 'cate_2']

start_time = time.time()
print "Script Started"
with arcpy.da.UpdateCursor(c_table, fields) as cursor:
    for row in cursor:
        # row's 0,1,2,3,4 = dev, herb, forest, water, category
        #classficiation water = 1; herb = 2; dev = 3; forest = 4
        if (row[3] >= 0 and row[3] > row[2]):
            row[4] = 1
        elif (row[2] >= 0 and row[2] > row[3]):
            row[4] = 4
        elif (row[1] > 180):
            row[4] = 2
        elif (row[0] > 1):
            row[4] = 3
        cursor.updateRow(row)
end_time = time.time() - start_time
print "Script Complete - " +  str(end_time) + " seconds"

UPDATE #1

I ran the same script on a computer with 40 gb RAM (the original computer had only 12 gb RAM). It successfully completed after ~16 hours. I feel that 16 hours is too long, but I've never worked with such large dataset so I don't know what to expect. The only new addition to this script is arcpy.env.parallelProcessingFactor = "100%". I'm trying two suggested methods (1) doing 1 million records in batches and (2) using SearchCursor and writing outputs to csv. I will report on the progress shortly.

UPDATE #2

The SearchCursor and CSV update worked brilliantly! I do not have the precise run times, I'll update the post when I'm in office tomorrow but I would say the approximate run time is ~5-6 minutes which is pretty impressive. I was not expecting it. I'm sharing my unpolished code any comments and improvements are welcomed:

import arcpy, csv, time
from arcpy import env

arcpy.env.parallelProcessingFactor = "100%"

arcpy.TableToTable_conversion("D:/mosaic.gdb/combine_2013", "D:/mosaic.gdb", "combo_table")
arcpy.AddField_management("D:/mosaic.gdb/combo_table","category","SHORT")

# Table
c_table = "D:/mosaic.gdb/combo_table"
fields = ['wat_agg', 'dev_agg', 'herb_agg','forest_agg','category', 'OBJECTID']

# CSV
c_csv = open("D:/combine.csv", "w")
c_writer = csv.writer(c_csv, delimiter= ';',lineterminator='\n')
c_writer.writerow (['OID', 'CATEGORY'])
c_reader = csv.reader(c_csv)

start_time = time.time()
with arcpy.da.SearchCursor(c_table, fields) as cursor:
    for row in cursor:
        #skip file headers
        if c_reader.line_num == 1:
            continue
        # row's 0,1,2,3,4,5 = water, dev, herb, forest, category, oid
        #classficiation water = 1; dev = 2; herb = 3; ; forest = 4
        if (row[0] >= 0 and row[0] > row[3]):
            c_writer.writerow([row[5], 1])
        elif (row[1] > 1):
            c_writer.writerow([row[5], 2])
        elif (row[2] > 180):
            c_writer.writerow([row[5], 3])
        elif (row[3] >= 0 and row[3] > row[0]):
            c_writer.writerow([row[5], 4])

c_csv.close()
end_time =  time.time() - start_time
print str(end_time) + " - Seconds"

UPDATE #3 Final update. The total run time for the script is ~ 199.6 seconds / 3.2 minutes.

Devils advocate - have you tried running it in foreground or from IDLE as looking at your script you don't need to have ArcMap open? — Hornbydd, Dec 06 '16 at 16:57
run it as a standalone script or if you know SQL, upload the shapefile to PostgreSQL and do it there — ziggy, Dec 06 '16 at 17:10
@Hornbydd - The first run from ArcMap failed and second run was from IDLE. — user81784, Dec 06 '16 at 18:08
@ziggy - Sadly, the office doesn't have access to PostgreSQL or SQL. — user81784, Dec 06 '16 at 18:10
PostgreSQL is open source and free. I work for the State and easily got it approved. I would look into that — ziggy, Dec 06 '16 at 18:22
I understand it is open source, but it approval process takes ~1-2 weeks, and this is time sensitive so I don't think it is feasible in this instance. — user81784, Dec 06 '16 at 18:58
Use definition query (e.g. OBJECTID BETWEEN 1 AND 1000000) and split it into 17 chunks — FelixIP, Dec 06 '16 at 19:55
Another idea is why not run the field calculate tool on your cate_2 field and move the if\then logic into the code block? — Hornbydd, Dec 06 '16 at 19:58
@Hornbydd calculate field is the frustratingly slow, might take ages to complete on 17 millions records — FelixIP, Dec 06 '16 at 20:00
Sounds annoying, but I will give it a shot, FelixIP. Thanks. — user81784, Dec 06 '16 at 20:10
@FelixIP, field calculate may be slow but I got the impression it's not speed that is the problem it is when cptpython runs their code that it never completes and fails. Your idea of chunking it up may be the best solution but I though give field calculate ago to see if it completes, albeit slower than other proposed methods. — Hornbydd, Dec 07 '16 at 10:31

score 4 · Answer 1 · answered Dec 06 '16 at 19:34

4

You could write the Objectid and result of the calculation (cate_2) to a csv file. Then join the csv to your original file, populate a field, to preserve the result. This way you aren't updating the table using DA cursor. You could use a Search cursor.

answered Dec 06 '16 at 19:34

klewis

7,475
17
19

I was thinking the same thing as there is a discussion here and they are talking about even larger datasets. – Hornbydd Dec 06 '16 at 19:57
Thanks, klewis. This sounds promising. I will try it along with FelixIP's suggestion, and interesting discussion though I'll have to run this a few dozen times. – user81784 Dec 06 '16 at 20:38
Worked brilliantly! I've updated the question with latest script. Thanks! – user81784 Dec 08 '16 at 02:40

Alex Tereshenkov · Answer 2 · 2016-12-13T08:21:17.757

The update of the code in the section #2 in your question doesn't show how you are joining the .csv file back to the original table in your file geodatabase. You say that your script took ~5 min to run. This sounds fair if you have only exported the .csv file without doing any joins. When you will try to bring the .csv file back to ArcGIS, you will hit the performance issues.

1) You cannot do joins directly from .csv to the geodatabase table, because the .csv file doesn't have an OID (having a field calculated with unique values won't help as you still will need to convert your .csv file into a geodatabase table). So, several minutes for Table To Table GP tool (you could use the in_memory workspace to create a temp table there, will be slightly faster).

2) After you've loaded the .csv into a geodatabase table, you would want to build an index on the field you would do the join (in your case, the source objectid vaue from the .csv file. This would take some minutes on 16mln rows table.

3) Then you would need to either use the Add Join or Join Field GP tools. Neither will perform well on your large tables.

4) Afterwards, you need to do the Calculate Field GP tool to calculate the newly joined field(s). Many minutes go here; even more, the field calculation takes more time when the fields that participate in calculation are coming from a joined table.

In a word, you won't get anything close to 5min you mention. If you will make it in an hour, I would be impressed.

To avoid dealing with processing of large datasets within ArcGIS, I suggest taking your data outside of ArcGIS into a pandas data frame and do all your calculations there. When you are done, just write the data frame rows back into a new geodatabase table with da.InsertCursor (or you could truncate your existing table and write your rows into the source one).

The complete code I've written to benchmark this is below:

import time
from functools import wraps
import arcpy
import pandas as pd

def report_time(func):
    '''Decorator reporting the execution time'''
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(func.__name__, round(end-start,3))
        return result
    return wrapper

#----------------------------------------------------------------------
@report_time
def make_df(in_table,limit):
    columns = [f.name for f in arcpy.ListFields(in_table) if f.name != 'OBJECTID']
    cur = arcpy.da.SearchCursor(in_table,columns,'OBJECTID < {}'.format(limit))
    rows = (row for row in cur)
    df = pd.DataFrame(rows,columns=columns)
    return df

#----------------------------------------------------------------------
@report_time
def calculate_field(df):
    df.ix[(df['DataField2'] % 2 == 0), 'Category'] = 'two'
    df.ix[(df['DataField2'] % 4 == 0), 'Category'] = 'four'
    df.ix[(df['DataField2'] % 5 == 0), 'Category'] = 'five'
    df.ix[(df['DataField2'] % 10 == 0), 'Category'] = 'ten'
    df['Category'].fillna('other', inplace=True)
    return df

#----------------------------------------------------------------------
@report_time
def save_gdb_table(df,out_table):
    rows_to_write = [tuple(r[1:]) for r in df.itertuples()]
    with arcpy.da.InsertCursor(out_table,df.columns) as ins_cur:
        for row in rows_to_write:
            ins_cur.insertRow(row)

#run for tables of various sizes
for limit in [100000,500000,1000000,5000000,15000000]:
    print '{:,}'.format(limit).center(50,'-')

    in_table = r'C:\ArcGIS\scratch.gdb\BigTraffic'
    out_table = r'C:\ArcGIS\scratch.gdb\BigTrafficUpdated'
    if arcpy.Exists(out_table):
        arcpy.TruncateTable_management(out_table)

    df = make_df(in_table,limit=limit)
    df = calculate_field(df)
    save_gdb_table(df, out_table)
    print

Below is the output from the Debug IO (the number reported is the number of rows in a table used) with info on execution time for individual functions:

---------------------100,000----------------------
('make_df', 1.141)
('calculate_field', 0.042)
('save_gdb_table', 1.788)

---------------------500,000----------------------
('make_df', 4.733)
('calculate_field', 0.197)
('save_gdb_table', 8.84)

--------------------1,000,000---------------------
('make_df', 9.315)
('calculate_field', 0.392)
('save_gdb_table', 17.605)

--------------------5,000,000---------------------
('make_df', 45.371)
('calculate_field', 1.903)
('save_gdb_table', 90.797)

--------------------15,000,000--------------------
('make_df', 136.935)
('calculate_field', 5.551)
('save_gdb_table', 275.176)

Inserting a row with da.InsertCursor takes a constant time, that is, if to insert 1 row takes, say, 0.1 second, to insert 100 rows will take 10 seconds. Sadly, 95%+ of the total execution time is spent reading geodatabase table and then inserting the rows back to the geodatabase.

The same is applicable to making a pandas data frame from a da.SearchCursor generator and to calculating the field(s). As the number of rows in your source geodatabase table doubles, so does the execution time of the script above. Of course, you still need to use the 64bit Python as during the execution, some larger data structures will be handled in memory.

Actually, I was going to ask another question that would talk about limitations of the method I used, because I ran into the problems you've addressed above so thanks! What I am trying to accomplish: combine four rasters and then perform if-else statement based on the columns and write out the outputs into a new column and finally perform Lookup to create raster based on the values in the new column. My method had many unecessary steps and inefficient workflow, I should have mentioned this in my original question. Live and learn. I will try your script later this week though. — user81784, Dec 11 '16 at 20:27

user81784 · Accepted Answer · 2016-12-14T15:05:15.380

Apologies, if I keep reviving this old thread. The idea was to perform the if-else statements on the combine raster and then use the new field in Lookup to create a new raster. I complicated the problem by exporting the data as a table and introduced inefficient workflow which was addressed by @Alex Tereshenkov. After realizing the obvious, I batched the data into 17 queries (1 million each) as it was suggested by @FelixIP. It took each batch on average ~1.5 minutes to complete and total run time was ~23.3 minutes. This method eliminates need for joins and I think this method best accomplishes the task. Here's a revised script for future reference:

import arcpy, time
from arcpy import env

def cursor():
    combine = "D:/mosaic.gdb/combine_2013"
    #arcpy.AddField_management(combine,"cat_1","SHORT")
    fields = ['wat_agg', 'dev_agg', 'herb_agg','forest_agg', 'cat_1']
    batch = ['"OBJECTID" >= 1 AND "OBJECTID" <= 1000000', '"OBJECTID" >= 1000001 AND "OBJECTID" <= 2000000', '"OBJECTID" >= 2000001 AND "OBJECTID" <= 3000000', '"OBJECTID" >= 3000001 AND "OBJECTID" <= 4000000', '"OBJECTID" >= 4000001 AND "OBJECTID" <= 5000000', '"OBJECTID" >= 5000001 AND "OBJECTID" <= 6000000', '"OBJECTID" >= 6000001 AND "OBJECTID" <= 7000000', '"OBJECTID" >= 7000001 AND "OBJECTID" <= 8000000', '"OBJECTID" >= 8000001 AND "OBJECTID" <= 9000000', '"OBJECTID" >= 9000001 AND "OBJECTID" <= 10000000', '"OBJECTID" >= 10000001 AND "OBJECTID" <= 11000000', '"OBJECTID" >= 11000001 AND "OBJECTID" <= 12000000', '"OBJECTID" >= 12000001 AND "OBJECTID" <= 13000000', '"OBJECTID" >= 13000001 AND "OBJECTID" <= 14000000', '"OBJECTID" >= 14000001 AND "OBJECTID" <= 15000000', '"OBJECTID" >= 15000001 AND "OBJECTID" <= 16000000', '"OBJECTID" >= 16000001 AND "OBJECTID" <= 16757856']
    for i in batch:
        start_time = time.time()
        with arcpy.da.UpdateCursor(combine, fields, i) as cursor:
            for row in cursor:
            # row's 0,1,2,3,4,5 = water, dev, herb, forest, category
            #classficiation water = 1; dev = 2; herb = 3; ; forest = 4
                if (row[0] >= 0 and row[0] >= row[3]):
                    row[4] = 1
                elif (row[1] > 1):
                    row[4] = 2
                elif (row[2] > 180):
                    row[4] = 3
                elif (row[3] >= 0 and row[3] > row[0]):
                    row[4] = 4
                cursor.updateRow(row)
        end_time =  time.time() - start_time
        print str(end_time) + " - Seconds"

cursor()

So, just to make sure I'm understanding this correctly. In your original post you said that when you ran this on a computer with 40GB RAM, it took ~16 hours total. But now that you split it up into 17 batches, and it took ~23 minutes total. Is that correct? — ianbroad, Dec 13 '16 at 08:50
Correct. The first run took ~ 16 hours with 40 GB RAM and second run took ~23 minutes + another ~15 minutes to perform Lookup and export the raster with newly defined categories. — user81784, Dec 13 '16 at 13:29
Just a note that arcpy.env.parallelProcessingFactor = "100%" has no effect on your script. I dont see any tools in there that leverage that environment. — KHibma, Dec 14 '16 at 14:30

score 1 · Answer 4 · edited Jun 11 '20 at 15:27

You could try changing to using CalculateField_management. This avoids looping through using cursors and, by the looks of your options for the category value, you could set this up as four subprocesses spawned sequentially. As each subprocess finishes its memory is released before starting the next one. You take a small (milliseconds) hit spawning each subprocess.

Or, if you want to keep your current approach, have a subprocess that takes x-rows at a time. Have a main process to drive it and, as before, you keep scavanging your memory each time it finishes. The bonus of doing it this way (especially through a stand-alone python process) is that you can make more use of all your cores as spawning subprocesses in python's multthreading you get around the GIL. This is possible with ArcPy and an approach I've used in the past to do massive data churns. Obviously keep your chunks of data down otherwise you'll end up running out of memory faster!

In my experience, using arcpy.da.UpdateCursor is way faster than arcpy.CalculateField_management. I wrote a script which runs on 55.000.000 building features, it was about 5 times slower with CalculateField tool. — offermann, Dec 07 '16 at 08:52
The point is to set up four subprocesses and scavenge the memory as that is the real pinch point here. As I outline in the second paragraph you could split the subprocesses by rows, but that takes a little more management than a single select. — MappaGnosis, Dec 07 '16 at 08:55

score 1 · Answer 5 · answered Dec 08 '16 at 06:02

The data manipulation logic can be written as an UPDATE SQL statement using a CASE expression, which you can execute using GDAL/OGR, e.g. via OSGeo4W with gdal-filegdb installed.

Here is the workflow, which uses osgeo.ogr instead of arcpy:

import time
from osgeo import ogr

ds = ogr.Open('D:/mosaic.gdb', 1)
if ds is None:
    raise ValueError("You don't have a 'FileGDB' driver, or the dataset doesn't exist")
sql = '''\
UPDATE combo_table SET cate_2 = CASE
    WHEN wat_agg >= 0 AND wat_agg > forest_agg THEN 1
    WHEN dev_agg > 1 THEN 2
    WHEN herb_agg > 180 THEN 3
    WHEN forest_agg >= 0 AND forest_agg > wat_agg THEN 4
    END
'''
start_time = time.time()
ds.ExecuteSQL(sql, dialect='sqlite')
ds = None  # save, close
end_time =  time.time() - start_time
print("that took %.1f seconds" % end_time)

On a similar table with just over 1 million records, this query took 18 mins. So it might still take ~4 to 5 hours to process 16 million records.

Unfortunately the script is part of a larger script written using arcpy but I appreciate the answer. I am slowly trying to use GDAL more. — user81784, Dec 08 '16 at 15:13

Looping through 16 million records using ArcPy?

5 Answers5