Gisky: geeky GIS: Don't use zonal statistics as a table in ArcGIS

Wednesday, March 7, 2018

Don't use zonal statistics as a table in ArcGIS

Short version: Don't ever use zonal statistics as a table in arcgis. It manages to do simple calculations such as calculating an average wrong.

Example of using zonal statistics between grid cells (from the ArcGIS manual)

Long version: I was hired by a customer to determine why they got unexpected results for their analyses. These analyses led to an official map with legal consequences. After investigating their whole procedure a number of issues were found. But the major source of errors was one which I found very unlikely: it turns out that the algorithm used by ArcGIS spatial analyst to determine the average grid value in a shape is wrong. Not just a little wrong. Very wrong. And no, I am not talking about the no data handling which was very wrong as well, I'm talking about how the algorithm compares vectors and rasters Interestingly, this seems to be known, as the arcgis manual states

It is recommended to only use rasters as the zone input, as it offers you greater control over the vector-to-raster conversion. This will help ensure you consistently get the expected results.

So how does arcgis compare vectors and rasters? In fact one could invent a number of algorithms:

Use the centers of the pixels and compare those to the vectors (most frequently used and fastest).
Use the actual area of the pixels
Use those pixels of which the majority of the area is covered by the vector.

None of these algorithm matches with the results we saw from arcgis, even though the documentation seems to suggest the first method is used. So what is happening? It seems that arcgis first converts your vector file to a raster, not necessarily in the same grid system as the grid you compare to. Then it interpolates your own grid (using an undocumented method) and then takes the average of those interpolated values if their cells match with the raster you supplied. This means pixels outside your shape can have an influence on the result. This mainly seems to occur when large areas are mapped (eg Belgium at 5m).

The average of this triangle is calculated by ArcGIS as 5.47

I don't understand how the market leader in GIS can do such a basic operation so wrong, and the whole search also convinced me how important it is to open the source (or at least the algorithm used) to get reproducible results. Anyway, if you are still stuck with arcgis, know that you can install SAGA GIS as a toolbox. It contains sensible algorithms to do a vector/raster intersection and they are also approximately 10 times faster than the ArcGIS versions. Or you can have a look at how Grass and QGIS implement this. All of this of course only if you consistently want to get the expected results...

And if your government also uses ArcGIS for determining taxes or other policies, perhaps they too should consider switching to a product which consistently gives the expected results.

Update March 18 2018: make sure you check out the comments from Steve Kopp (spatial analyst development team) below and the discussion - it is interesting.

12 comments:

FionaMarch 8, 2018 at 7:21 AM
For heaven's sake, I use this tool all the time....What version of ArcGIS were they using?
ReplyDelete
Replies
Johan Van de WauwMarch 8, 2018 at 7:37 AM
The error is present in versions 10.1, 10.3 and I could also reproduce in a recent trial of arcgis pro.
I guess it is still present in more recent releases.

Note that if you click on the environments button on the bottom of the dialog, you can set at which grid the interpolation should happen, which may give you better results.

But I find that a bad practice, there is no valid reason not to use the grid system of the only grid used in the analysis.

https://community.esri.com/thread/207069-zonal-statistics-as-table-issue
ReplyDelete
Replies
FionaMarch 8, 2018 at 8:21 AM
Thanks for the alert. This explains some descrepancies I have noticed in my analyses, like totals calculated from means of zones not adding up to the same as the total area.
ReplyDelete
Replies
UnknownMarch 13, 2018 at 10:57 AM
That boosts my confidence to inch more close to open source GIS.
ReplyDelete
Replies
Jonas HambergAugust 30, 2018 at 12:37 PM
Thanks for the heads up! I used the saga toolbox with weighted areas instead, much more accurate for my analysis.
ReplyDelete
Replies

Add comment

Thanks for your reaction.

Due to a recent surge in spam I have decided to moderate all reactions on this blog - I'll review your message as soon as possible.