Wednesday, March 7, 2018

Don't use zonal statistics as a table in ArcGIS

Short version: Don't ever use zonal statistics as a table in arcgis. It manages to do simple calculations such as calculating an average wrong.

Example of using zonal statistics between grid cells (from the ArcGIS manual)

Long version: I was hired by a customer to determine why they got unexpected results for their analyses. These analyses led to an official map with legal consequences. After investigating their whole procedure a number of issues were found. But the major source of errors was one which I found very unlikely: it turns out that the algorithm used by ArcGIS spatial analyst to determine the average grid value in a shape is wrong. Not just a little wrong. Very wrong. And no, I am not talking about the no data handling which was very wrong as well, I'm talking about how the algorithm compares vectors and rasters Interestingly, this seems to be known, as the arcgis manual states
It is recommended to only use rasters as the zone input, as it offers you greater control over the vector-to-raster conversion. This will help ensure you consistently get the expected results.
So how does arcgis compare vectors and rasters? In fact one could invent a number of algorithms:

  • Use the centers of the pixels and compare those to the vectors (most frequently used and fastest).
  • Use the actual area of the pixels 
  • Use those pixels of which the majority of the area is covered by the vector. 

None of these algorithm matches with the results we saw from arcgis, even though the documentation seems to suggest the first method is used. So what is happening? It seems that arcgis first converts your vector file to a raster, not necessarily in the same grid system as the grid you compare to. Then it interpolates your own grid (using an undocumented method) and then takes the average of those interpolated values if their cells match with the raster you supplied. This means pixels outside your shape can have an influence on the result. This mainly seems to occur when large areas are mapped (eg Belgium at 5m).

The average of this triangle is calculated by ArcGIS as 5.47

I don't understand how the market leader in GIS can do such a basic operation so wrong, and the whole search also convinced me how important it is to open the source (or at least the algorithm used) to get reproducible results. Anyway, if you are still stuck with arcgis, know that you can install SAGA GIS as a toolbox. It contains sensible algorithms to do a vector/raster intersection and they are also approximately 10 times faster than the ArcGIS versions. Or you can have a look at how Grass and QGIS implement this.  All of this of course only if you consistently want to get the expected results...

And if your government also uses ArcGIS for determining taxes or other policies, perhaps they too should consider switching to a product which consistently gives the expected results.

Update March 18 2018: make sure you check out the comments from Steve Kopp (spatial analyst development team) below and the discussion - it is interesting.