5

I am creating IDF Curves for work using python and 30-minute rainfall rate rasters. There are 24 rasters per day and about two decades worth of data. I basically input lat/lon coordinates and my python script grabs the pixel value from each raster throughout the period and stores it in a DataFrame to use for statistical analysis. This process currently takes about 5 hours to run per lat/lon coordinate with the majority of the time coming from opening the raster and extracting the pixel value. Are there any ways to optimize this process for speed? I am completely stuck and can't find if there is any speed efficiencies native to GDAL. I've submitted the function I use for extracting the pixel value. Please let me know if you have any questions or suggestions, thanks!

def extract_point_values_from_raster(raster_filepath, lat_lon_str):

    lat_lon = lat_lon_str.split(',')
    lat = float(lat_lon[0])
    lon = float(lat_lon[1])

    try:
        src_ds = gdal.Open(raster_filepath)
        gt = src_ds.GetGeoTransform()
        rb = src_ds.GetRasterBand(1)

        if (lat < gt[3] and lat > (gt[4] - gt[3]) and 
            lon > gt[0] and lon < (gt[2] - gt[0])):

            px = int((lon-gt[0])/gt[1])
            py = int((lat-gt[3])/gt[5])

            intval = rb.ReadAsArray(px,py,1,1)

            if intval == None:
                raster_val = None
            else:
                raster_val = round(intval[0][0],2)

        del src_ds

    except:
        raster_val = None

    return(raster_val)

JWB
  • 574
  • 5
  • 19
  • Are you doing this for multiple coordinates? One approach would be sampling all the coordinates for the raster you are opening at the time. I can post a numpy solution that might be helpful if that is the case. – Marcelo Villa Mar 25 '20 at 21:11
  • thank you @MarceloVilla, but currently i am only analyzing one coordinate at a time! – JWB Mar 25 '20 at 21:36
  • You could look into: https://gis.stackexchange.com/questions/358036/extracting-data-from-a-raster, but I am not sure that it would be faster or not. – snowman2 Apr 25 '20 at 04:17
  • Did you flush the cache? Might free up memory. – wingnut Apr 29 '21 at 07:10
  • Is the source dataset netCDF related? – huckfinn Dec 28 '21 at 01:21

1 Answers1

0

Some of my initial questions/thoughts/things to experiment with:

  • Every time you call the function it's re-opening the raster, then performing an intersection check. Presumably there may be multiple lat/lon queries against a given raster

    • Can you pre-index all of the the raster bounds (e.g. with an rtree), then query all of a raster's intersecting points at once? This index could be built once and then saved for later queries.
    • Or, can you lazily open a raster only when needed, but then keep it open in case it's needed again later?
  • Since it seems to be IO bound, make sure the rasters are located on the fastest storage you have access to, ideally local solid state drives

  • The raster format can have an impact on performance. For example, if a raster is interleaved by line, I believe that GDAL has to read the entire line to retreive your pixel of interest. If you're repeatedly reading this dataset, consider trying

    • Converting to an internally-tiled format, if not already
    • Stacking so that you have one 24-band raster per day and experimenting to find the best interleave (BIP perhaps?)
  • Have you profiled the code to ensure that it's slowest where you think it is, and that there are no other significant bottlenecks?

mikewatt
  • 5,083
  • 9
  • 21
  • 1
    Thanks for the reply! So I'm attempting to read through about 350,000 different raster files and getting the value at the same lat/lon for each file. I don't use the same raster twice. This also is where the problem comes in with being unable to store the rasters on my fastest storage as I just don't have the space on that drive, so I have them saved to a synology. I have profiled the code, though, and this is the spot taking the most time! – JWB Mar 25 '20 at 19:34
  • 1
    I see. For any given point query will there be valid data in every one of those 350,000 rasters, or only a subset of them? In other words, does every file cover the exact same area? – mikewatt Mar 26 '20 at 18:35