Today I committed changes to QGIS master that I started working on at the Lisbon hackfest and finalised this week (along with a unit test yay!). The changes implement improvements to the way in which statistics are gathered for rasters. First a little history: the raster stats code in QGIS is some of the earliest work I have done in QGIS. The code was also hugely inefficient – basically it completely scans every pixel in the file twice! It does this because it extracts minima, maxima, stddev etc. from each band in the raster – and the algorithm used to calculate the stddev requires first calculating the sum of means, hence the second pass. Of course we use the underlying GDAL library to traverse the raster, but we completly bypassed the GDAL stats gathering functions. Now from my testing (see below) the difference of gathering stats in QGIS as opposed to directly in GDAL was not that great. However, the real killer is the fact that QGIS had no reasonable implementation to cache these statistics – either within a session or between sessions. This resulted in quite a lot of complaints about the slow loading of rasters in QGIS and the spontaneous recollection of stats when making various changes to raster symbology etc.
With the release of 1.7, Radim Blazek implemented ‘proper’ support for raster providers so that they follow a similar model to the vector providers in QGIS. This opens up room for provider specific optitmisations and for implementing new providers in the future e.g. for PostGIS Raster, Terralib etc. At the Lisbon hackfest, I refactored the provider interface so that it provides a default implementation of stats gathering which can be overloaded by specific providers, each then being able to use best practice for that datasource when gathering stats. I then updated the GDAL Provider to use gdal native calls to fetch statistics.
When GDAl gathers stats, it stores them in a file named after the source raster but with a .aux.xml extension. As a bonus, GDAL also stores histogram data in this file. Thanks to some excellent tips from Etienne (Sky?) on the QGIS mailing list my implementation was further improved so that it actually uses these cached stats from GDAL properly. So let’s see how performance is improved with these changes. I tested using a ~650mb tif hillshaded DEM from one of my clients.
Added bonus: My changes also use a more efficient approach for obtaining min/max for a band (before all stats are gathered) and also histgram calculations are cached now. So after the initial gathering of stats and histograms, all activities when working with GDAL rasters should be near instant now. By the way the 1 in the graph above is rounded up – in my usage it was near instant in most cases.
I need to still evaluate whether it will be good to backport these changes to the 1.7 stable tree, so for now if you want to benefit from these changes, you need to be using a nightly build or build from source.
Every 6 months we hold a hackfest / QGIS meeting and many of our users generously donate towards it. I hope the above article gives those who do support our meetings a sense of the tangible benefits that arise from these meetings – there have been many other improvements coming into master that arose from the last hackfest that we all get to enjoy! If you would like to help towards the costs of the next hackfest, visit http://qgis.org and click the ‘Donate!’ button on the left hand side. Have fun!