How can I determine which raster attribute covers the highest area of a polygon?

Question

I'm relatively new to GIS and QGIS but I have a problem that I'm not certain how to solve. I searched here for relevant information but found none. (Or, I used incorrect search terms.) My problem is thus:

I have a raster file of crop data determined from satellite imagery. Pixel size is 30 meter x 30 meter.

I have a shapefile with over 270,000 polygons of varying sizes.

I can get general 'zonal statistics' through QGIS but these are relatively meaningless: count, mean, and sum don't help me in determining the predominant--by area--crop type within a given polygon.

Ideally, I'd like to be able to determine which crop type covers the highest area in every polygon. As an example, in the image below, the parcel in the center of the image has at least six crop types within it's boundaries. However, clearly there is one predominant crop type. How can I associate that single crop type with that parcel? Is this possible?

An example image

I'm using QGIS version 2.4.0 on Windows 7.

If there is an R solution, I would enjoy hearing that, too.

Wouldn't your zonal statistics mean be the predominant cell value for each polygon? Is there a source attribute (polyID) in the output table to relate the statistics back (via join) and visually compare the results against expectations. — Michael Stimson, Nov 06 '14 at 01:52
Upon reflection the mean of classified data is meaningless (no pun intended) you might find some relief or solace in this question answered by Curlew http://gis.stackexchange.com/questions/59238/is-there-a-zonal-statistics-tool-to-calculate-median-and-mode-raster-values-for . it also references the R method for doing the same. — Michael Stimson, Nov 06 '14 at 02:46
Yes, the mean would be pretty worthless. I considered it, but given that a crop type of, for example, 150, and a crop type of 1 would provide a mean crop type of (150+1)/2 which is something else entirely.
I think the mode would work, but only if all pixels of the same crop type are labeled individually rather than merged together.

I've looked into an R solution for this problem, namely (I think; my code is on a different computer) with the overlay() function in the raster package. I think that has promise, but for some reason my raster was being read in filled with NA values. — Steven, Nov 06 '14 at 14:20
Do you have access to ArcGis with Spatial Analyst? http://resources.arcgis.com/en/help/main/10.2/index.html#//009z000000w8000000 offers a Majority statistic. — Michael Stimson, Nov 06 '14 at 22:29
I don't. I prefer using open-source software but if I can't find a workable solution may download the trial version of Arc. Thanks for your comments though! — Steven, Nov 07 '14 at 19:14
Are you perhaps using NASS crop data. If so, you do not have 255 crop types and you need to apply some pre-processing to your data before using it. This is a standard 8-bit convention that NRCS uses but there are only certain values assigned crop types, the rest are place holder values that are meaningless. Please look at the NASS faq. Here is a download for dbf file the attribute definitions that can be related to the raster values: www.nass.usda.gov/research/Cropland/docs/generic_cdl_attributes.tif.vat.dbf.zip — Jeffrey Evans, Nov 07 '14 at 22:25

score 2 · Answer 1 · edited Nov 07 '14 at 21:53

Here is an R solution that uses the mode of the observed distribution.

Create some example data

require(raster)
  r <- raster(ncol=36, nrow=18)
    r[] <- sample(1:3, ncell(r), replace=TRUE)
      polys <- SpatialPolygons(list(Polygons(list(Polygon(rbind(c(-180,-20),c(-160,5),
                 c(-60, 0),c(-160,-60),c(-180,-20)))), 1), 
                 Polygons(list(Polygon(rbind(c(80,0),c(100,60),c(120,0),
                 c(120,-55),c(80,0)))), 2)))
    polys <- SpatialPolygonsDataFrame(polys, data.frame(row.names=c("1","2"),
                                      ID=1:2))
plot(r)
  plot(polys, col="red", add=TRUE)

Function to calculate peak mode of distribution. This could alternatively, be done by fitting a spline to x. The density approach is just a nice shortcut.

dmode <- function(x) {
  den <- density(x, kernel=c("gaussian"))
    ( den$x[den$y==max(den$y)] )   
}

Here we extract values for all polygons and calculate mode value using the dmode function. Note; the base R function "mode" returns the storage type of the object class (e.g., numeric, character) and not the distributional mode. Since the mode is a function of the peak distribution we need to round it back to a whole number so it corresponds to a "real" value in the observed distribution. If the raster data were floating point or you wanted a true hinge point, then this would not be necessary.

( ep <- extract(r, polys) )
  ep <- lapply(sp, function(x) x[!is.na(x)]) # remove NA values
    ( x <- round(unlist(lapply(ep, FUN=dmode)),digits=0) )

Since you have a large number of polygons you may want to implement a memory safe for loop that processes one polygon at a time. However, this may be slow.

x <- vector()
  for(i in 1:nrow(polys)){
    p <- polys[i,]
      v <- extract(r, p)
        v <- lapply(v, function(x) x[!is.na(x)]) 
          x <- append(x, round(unlist(lapply(v, FUN=dmode)),digits=0) )
    }
      cat("Mode(s):", x,  "\n")

You can then join the resulting "mode" values back to each polygon and plot results.

polys@data$MajClass <- x 
cols <- ifelse(polys@data$MajClass == 1, "red",
          ifelse(polys@data$MajClass == 2, "green",
            ifelse(polys@data$MajClass == 3, "blue", NA)))
  plot(r, legend=FALSE)
    plot(polys, col=cols, add=TRUE)
      legend("bottomright", legend=c(unique(x)), fill=cols)

I would add, another efficient way to calculate frequencies is the table function. Let's create a vector of 1,2,3, the table function will then return counts of each value in the vector.

x <- c(1,1,1,2,2,3,3,3,3,3)
 table(x)

We can then return the class name associated with the most frequent value.

names(which.max(table(x)))

In the context of the problem at hand, we can use lapply function to the extracted raster values.

( ep <- extract(r, polys) )
  ep <- lapply(sp, function(x) x[!is.na(x)])
    ( x <- unlist(lapply(ep, FUN=function(x) { as.numeric(names(which.max(table(x)))) })) )

An exact frequency "count" approach will give a more precise answer. Let's take an example where extreme values are existing in the same polygon and compare mode and count.

x=c(1,1,1,2,2,255,255,255,255,255)
  round(dmode(x),digits=0)
    names(which.max(table(x)))

score 1 · Accepted Answer · answered Dec 11 '14 at 16:18

I ended up using an R solution almost identical to that provided by @JeffreyEvans.

First, I was able to pare down the number of polygons of my shapefile I was genuinely interested in. I pared these data down by eliminating those polygons that were outside of city limits in the county in which I was interested. This was a simple process done in QGIS to intersect my parcel layer with a city limits layer.

I used the extract() function from the Raster package with the below function for mathematical Mode to extract the singular value for each polygon (of only ~22,000 rather than ~290,000):

  Mode <- function(x, na.rm = FALSE) {
    if(na.rm){
      x = subset(x, !is.na(x))
    }

    ux <- unique(x)
    return(ux[which.max(tabulate(match(x, ux)))])
  }

(I wrote this Mode() function to avoid the density() function returning values that were not integers.)

Then it was just an issue of applying the extract() function with my raster and SpatialPolygonsDataFrame, and Mode() function as values:

    ep <- as.data.frame(extract(myRaster, myShapefile, fun = Mode, na.rm = T))
    names(ep) <- c("cropValue")

then inserting the crop data back into my SpatialPolygonsDataFrame:

    myShapefile@data$cropValue <- ep[,1]

I timed how long the data processing took. Initially, I ran the extract function on only the first 10 and 100 rows of my data. For the original SpatialPolygonsDataFrame, I calculated that it would have taken > 100 hours to do these analyses. I contemplated running it on a super-fast virtual machine via AWS EC2 but was able to pare down the spatial data frame to < 10% of its initial size. Still, running the above function on only ~22,000 polygons still took > 6 hours on my moderately fast desktop.

Thanks for the tips and advice! I was unable to discover a QGIS solution but was intrigued enough to start learning Python to remedy that in the future. However, R did the job.

Thanks again.

dof1985 · Answer 3 · 2014-11-07T20:17:05.010

0

Update: I have thought of a simpler way. Yet it has its own risks ivolving rasterizing your polygons. The original suggestion is below this one.

Start with rasterizing your parcel polygon layer. Than use r.statsitics from the GRASS plugin in QGIS (documentation) to compute the mode of the crops raster in each one of the parcels raster (each of them must have a unique data within it, base on ID of the vector layer). After that you might either vectorize the result again, this time each raster has the attribute of its dominant (mode) crop, or you can use zonal statistics with the resulting (statistics) raster and your original polygon. For that second option, it would be easier to conclude from the mean, min and max what is the dominant crop within each polygon (if the rasterized polygons and the vector polygins will be perfectlly overlayed, than all statistics will be equal to the crop code).

This is a bit of a work around soltuion, since ArcGIS is out of hand for you. What you are looking acctually is the mode of the ratster attributes in each one of your polygons, e.g. land parcles.

You might use model builder or a script to loop the following steps:

For each crop type, between 1 to n; extract a new binary raster using the raster calculator, e.g. for crop number 1 (wheat for example) use th syntax crops@1=1.
Use Zonal statistics with each binary raster crop_1 to crop_n use zonal statistics to calculate the count of the crop within each polygon. You might prefer to call the column "crop1_cnt","crop2_cnt", etc.
Last, use the field calculator with a CASE...ELSE expressions to feed the dominant (or mode) crop for each polygon; literally for each row find the highest count of all crop counts and write the corresponding crop code to a new dominant_crop column.

It might be complex, but I couldn't find a tool that calculate the "mode" for zonal statistics in grass or qgis.

edited Nov 07 '14 at 20:17

answered Nov 07 '14 at 19:59

dof1985

3,146
20
32

A R soulution might be found here: http://gis.stackexchange.com/questions/63795/zonal-statistics-for-polygons-with-r; yet, a mode function should be written. – dof1985 Nov 07 '14 at 20:09
1

The mode is the peak value of a distribution regardless of it representing the central tendency of the distribution and makes very little sense in this regard. I believe that the question is, in fact, wanting the majority class associated with crop types in each polygon. – Jeffrey Evans Nov 07 '14 at 20:15
@JeffreyEvans you are right, however taking into account the crop categories are infact a nominal variable, the mode (or the most frequent category) within a polygon (i.e. parcel) will also represent the majority. – dof1985 Nov 07 '14 at 20:18
1

@JeffreyEvans, in the raster image, crop type is represented by a numeric value between 1 and 255. The mean is useless as discussed above, the median would be equally worthless, but the mode would represent which crop type is most frequent in the distribution of crop types in a given parcel.
You're right that the question is to determine the majority class within each polygon. The mathematical solution is simply the mode. However,the answer provided by dof1985 presents another solution that I'm about to try.
– Steven Nov 07 '14 at 20:36
@dof1985, I'm working on that R solution as well. The function for 'Mode' is the easy part. For some reason, my raster is being read into R with no data. Identifying why that's the case is the difficult part. Thanks for the answer you've provided; I'm going to try that now. – Steven Nov 07 '14 at 20:37
@Steven. Hope one of those will work for you. Pleas update with a solution, I'm curios...Good luck – dof1985 Nov 07 '14 at 20:38
So, you have 255 crop types? This seems to be representing something other than "type". What exactly happens if a given polygon exhibits a bimodal distribution? – Jeffrey Evans Nov 07 '14 at 20:48
I'll update with a solution after I've tried those presented here. I don't think it'll be by this evening, but I'll keep everyone updated.
@JeffreyEvans: Thanks for the R solution. I'm more comfortable in R than GIS. And yes, the raster data downloaded from CropScape (USDA) provides 255 different values for crops (rather, specific commodities) though upon closer examination, 123 of those 255 are 'blank.'

If a bimodal distribution, I'm in a tough spot, I guess. My client knows that the data from the USDA is messy and we may have to adjust accordingly.
– Steven Nov 07 '14 at 23:15
The "dmode" function that I provided in my example will not have an issue with a bimodal distributions, it will just pick the mode with the higher peak value. – Jeffrey Evans Nov 07 '14 at 23:19
Interesting fact: @JeffreyEvans, you and I have played Irish music together in Laramie! I'm a beginner guitar player and sat in on some sessions with you on Front Street!
This, however, is unrelated to the topic at hand!
– Steven Nov 07 '14 at 23:22
@Steven, the raster you downloaded has values between 0 - 255, since it had an 8-bit radiometric resolution, namely gray levels (or intensity of refelctance caught by satellite sensor) can be represented by a number between 0 (low intensity) to 255 (high intensity). However, as you have already observed, it doesn't necessarly mean that it uses all the radiometric range. In addition, you might prefer to use categorial symbology on a min-max strech for crop types data. – dof1985 Nov 08 '14 at 17:17

How can I determine which raster attribute covers the highest area of a polygon?

3 Answers3

Linked