Postgres & PostGIS

Dissolving features by an attribute

Hi, I’m Sam. I’ve been learning a lot here at Linfiniti (thanks to the brilliant team!) Just like to add a quick note on one of the tasks I learnt this week.

I was working on a shapefile of the suburbs in Cape Town. A client required the suburbs to be grouped by region. After the tedious part of manually grouping the suburbs (using a created field, REGION), the unioning (dissolving) of the suburbs proved to be quick and painless through PostgreSQL, using this SQL command that implements the geomunion function:

create table ct_regions as select geomunion(the_geom), "REGION" \
 from "ct_suburbs" group by "REGION";

“ct_suburbs” is the original shapefile that was loaded into the PostGIS database using the Quantum GIS ‘SPIT’ plugin. “REGION” is the class (attribute) that I wish to union by.  And ct_regions will be the output shapefile. See the result here:

Before dissolving (ct_suburbs shows suburbs)

Cape Town suburbs before dissolving

After dissolving by suburbs (ct_regions shows collections of suburbs that have been merged)

Cape Town suburbs after dissolving

Hopefully this will be of some use when it comes to your own mapping!

Getting up and running with PostGIS in a jiffy

I’ve posted this before on my old blog so this is a repeat for those looking to get going with PostGIS in a hurry. This procedure should work on Ubuntu Jaunty or Ubuntu Karmic and possibly earlier versions.

sudo apt-get install postgresql-8.3-postgis
sudo su - postgres
createuser -s -d -r -l -P -E timlinux
createdb gis
createlang plpgsql gis
psql gis < /usr/share/postgresql-8.3-postgis/lwpostgis.sql
psql gis < /usr/share/postgresql-8.3-postgis/spatial_ref_sys.sql

Having done that, you should be able to connect to the database by creating a new postgis connection in QGIS. After that you can start to load data into your spatial database using SPIT (QGIS plugin to import shapefiles into QGIS) or the shp2psql command line tool.

Clipping data from Postgis

So you have moved your spatial data into PostGIS and everything is great….until you start wanting to clip it out again. Luckily clipping data within the postgres environment is trivial with the little help of SQL. However I wanted something a little more fancy – a way to iterate through all my spatial tables and clip out an arbitary polygon, saving the result to a shapefile. To achieve this I wrote a little bash script that does just that. The script will also optionally make sql dumps of the data and clean up after itself when it is done.

The SUFFIX option is used to append to the newly created shapefile / table / sqldump file.
The CLIPTABLE is the table containing polygons that should be used to intersect each table.
The CLIPFIELD is a field used to subset the cliptable e.g. to select a single polygon only.
The CLIPGEOMFIELD is the name of the geometry field in the CLIPTABLE.
The CLIPFIELDVALUE is the clipfield ‘where’ clause e.g. “Eastern Cape”
The DB is the database containing tables to be intersected.

If you set DELETETABLE to 0, it will create clip tables in the database with the suffix you specify.
If you set DELETEGEOM to 1 it will delete the original geometry, leaving only the intersected geometry.
If you set SHPDUMP to 1 it iwll create a shapefile for each intersected table.
If you set SQLDUMP to 1 it will make a sql dump file of each intersected table.

Note that for large dataset, clipping with an arbitary polygon can take a long time! Here is the script:

#!/bin/bash
SUFFIX="ecape"
CLIPTABLE="ecape"
CLIPFIELD="provname"
CLIPGEOMFIELD="the_geom"
CLIPFIELDVALUE="Eastern Cape"
DB="cdsm50k"
DELETETABLE=1
DELETEGEOM=1
SHPDUMP=1
SQLDUMP=0
pushd .
cd /tmp

for TABLE in `echo "\dt" | psql $DB \
    | awk '{print $3}' | grep "^[a-zA-Z]"`

do
  GEOMFIELD=`echo "\d $TABLE" | psql $DB | grep "geometry" \
  | grep -v "geometrytype" |awk '{print $1}'`
  if [ "$GEOMFIELD" == "" ]
  then
    echo "$TABLE has no geometry column, skipping"
  else
    echo "$TABLE -> $GEOMFIELD"
    echo "drop table \"${TABLE}_${SUFFIX}\";" | psql $DB
    # Note we use the && bounding box query first and
    # then the intersects query to avoid unneeded comparison of
    # complex geometries.
    SQL="CREATE TABLE \"${TABLE}_${SUFFIX}\" AS \
            SELECT \
            ST_Intersection(v.$GEOMFIELD, m.$CLIPGEOMFIELD) AS intersection_geom, \
            v.*, \
            m.$CLIPFIELD \
            FROM \
              \"$TABLE\" v, \
              $CLIPTABLE m \
            WHERE \
              ST_Intersects(v.$GEOMFIELD, m.$CLIPGEOMFIELD) AND \
              $CLIPFIELD='$CLIPFIELDVALUE';"

    echo $SQL  | psql $DB

    if [ $DELETEGEOM -eq 1 ]
    then
      echo "alter table \"${TABLE}_${SUFFIX}\" drop column $GEOMFIELD;" | psql $DB
    fi

    if [ $SHPDUMP -eq 1 ]
    then
      pgsql2shp -f ${TABLE}.shp -g "intersection_geom" $DB ${TABLE}_${SUFFIX}
    fi

    if [ $SQLDUMP -eq 1 ]
    then
      pg_dump -D $DB -t ${TABLE}_${SUFFIX} > ${TABLE}_${SUFFIX}.sql
    fi  

    if [ $DELETETABLE -eq 1 ]
    then
      echo "drop table ${TABLE}_${SUFFIX};" | psql $DB
    fi
  fi
done
echo "vacuum analyze;" | psql $DB
popd

EAV Database Modelling

Sometimes just because a thing is possible doesn’t make it a good idea. In the last week I have been helping a client whose web developer has dissapeared. The developer implemented the database using a variation on the Entity Attribute Value (EAV) modelling paradigm. The database consists of only four tables: Entities, Attributes, Values, EntityTypes. So the entities in the database don’t match any real-world entities – there is no products table for products, no categories table for categories and so on. This generic modelling approach might sound good but in practice it’s a real pain. Look at this query diagram representing a simple query to find out what categories exist in the database:

EAVDBModel

The query consists of numerous joins and is completely opaque. The EAV wikipedia article I referenced above contains a list of downsides to the EAV approach, with which I can only concur. For a new developer approach an EAV implementation its an uphill battle to understand what is going on. Writing queries is a cumbersome task, and the logic such as foreign key constraints that would normally reside in the database layer now needs to be implemented in the application codebase.

For me EAV is something I will be avoiding in any new developments I embark on…

Automatically dumping all Postgres tables into their own SQL files

Someone asked on twitter it is possible to dump all the tables in Postgres to individual shp files. Some time ago I wrote a script to dump all tables as SQL dumps. The question prompted me to tweak that script to drop out shapefiles instead.

My original script looked like the listing below. The dump files contain data only (see the comments in the bash script below) because I use this script to create fixtures for my django projects.

#!/bin/bash

# A script to create sql formatted fixtures (serialised models)
# used to initialise the application if you install it to another
# machine. You should run this any time you change your models
# or when you need to make a backup of all your data.

# Tim Sutton 2009
mkdir bees/sql
for TABLE in `echo "\d" | psql sabio | grep -v seq | awk '{print $3}'`
do
  echo $TABLE
  # -a data only
  # -t table
  # -D dump as sql inserts
  pg_dump -a -t $TABLE -D sabio > bees/sql/${TABLE}.sql
  #bzip2 bees/sql/${TABLE}.sql
done

To make the script drop out shapefiles I modified it a bit as shown in the next listing. Obviously as we are dumping shapefiles, we should only bother dumping tables with geometry in them so I went the route of using the geometry_columns table to decide which tables to dump…

#!/bin/bash

# A script to dump shapefiles of all tables listed in geometry_columns
# Tim Sutton 2009
mkdir bees/sql
for TABLE in `echo "select f_table_name from geometry_columns;" | psql sabio \
  | head -n -2 | egrep -v "\-\-\-\-\-\-\-\-\-" | egrep -v "f_table_name"`
do
  echo $TABLE
  pgsql2shp sabio $TABLE
done

Hope this is useful to someone out there :-)

Handy tip for the day : Backing up PostgreSQL data to a remote machine

Ok so I have a few production databases that I need to back up regularly. The trick is I want to run the backup from a remote machine so that the backup lives on a separate server to the actual database system. You can run backups manually like this (assuming your database is called ‘postgis’):

pg_dump -h dbhost -f postgis_`date +%d%B%Y`.sql.tar.gz -x -O -F tar postgis

When you run the above command, you will be prompted for a password. After entering the password you will find a date stamped backup. Very nice, but you may have noted that pg_dump has no option for giving it the password on the command line – it expects you to do that interactively. So what do we do if we need to automate the packup using a cron job? The solution is to use either ~/.pgpass or the PGPASSWORD environment variable. So here is how I automated the backup by placing a script in /etc/cron.daily/

export PGPASSWORD=secret
pg_dump -h dbhost -f postgis_`date +%d%B%Y`.sql.tar.gz -x -O -F tar postgis

PostGIS 1.4 Released

Good news if you are a PostGIS fan – the PostGIS team just released version 1.4! See the release announcement for full details…