Thursday, January 19, 2017

Least-cost modelling with Python using scikit-image

I've done a lot of least-cost modelling over the years, and I'm always interested to find software that enables this method to be applied in an open a way as possible - which for me means Python and R code! So I was excited to stumble across some functionality in the scikit-image package that has a graph module that does this. While the scikit-image package is concerned with image processing, GIS raster datasets are of an identical form to an image, as both can be represented as a 2-dimensional NumPy array. As the authors of the package explain in their paper that describes scikit-image, using NumPy arrays as the basis for scikit-image maximises compatibility with other Python packages (van der Walt et al. 2014), which would include most other geospatial Python packages. While scikit-image uses slightly different terminology, talking about "minimum cost paths" rather than "least-cost paths" as we tend to do in the geography/ecology world that I live in, on paper the approach looks identical to those commonly implemented in GIS software as it applies Dijkstra's algorithm with diagonal connectivity between cells. But before leaping into adopting this approach, I thought that while experimenting with scikit-image it would be wise to double-check that the results are the same as I would expect through more conventional GIS approaches, so I wrote some code:
# Import packages
import numpy as np
from skimage import graph
import nlmpy

# Set random seed so results are the same each time script runs

# Create a hypothetical cost-surface using the midpoint displacement method
cs = nlmpy.mpd(101, 101, h=1) * 5
# Make any values below one equal to no data, cs < 1, -9999)
# Specify the size of the cell
cellSize = 10

# From the cost-surface create a 'landscape graph' object which can then be
# analysed using least-cost modelling
lg = graph.MCP_Geometric(cs, sampling=(cellSize, cellSize))

# Choose a starting cell location
startCell = (5, 5)

# Calculate the least-cost distance from the start cell to all other cells
lcd = lg.find_costs(starts=[startCell])[0]

# Export the data for comparison in GIS software
startGrid = np.zeros((101,101)) - 9999
startGrid[5,5] = 1
nlmpy.exportASCIIGrid("start.asc", startGrid, cellSize = 10)
nlmpy.exportASCIIGrid("cost-surface.asc", cs, cellSize = 10), np.isinf(lcd), -9999) # insert no data values
nlmpy.exportASCIIGrid("least-cost-distances.asc", lcd, cellSize = 10)
I then took the output files created here, and processed them in ArcGIS using the cost distance tool. The results from both the scikit-image approach I coded and the ArcGIS results are shown in the picture at the top of the page. The key take home point is that they are, with the exception of some rounding issues beyond the third decimal point, identical. So I'm now confident in saying that the graph module of scikit-image does allow you to do least-cost modelling.

Of course it's one thing to verify the results, the other consideration here is computational speed. Faster is obviously better in general terms, but I'm particularly keen to have efficient least-cost modelling software as it enables better estimates of uncertainty to be calculated through iterative modelling. Recognising and quantifying the uncertainty associated with least-cost modelling is, I think, a very important area of research for the future - I discuss this issue in more detail in a recent review paper (Etherington 2016). So in addition to verify that the results are the same, I also ran some code to assess speed of processing:
# Import packages
import numpy as np
from skimage import graph
import time
import matplotlib.pyplot as plt

# Specify the cost-surface dimensions to process
dims = [100, 200, 400, 800, 1600, 3200, 6400]

# Create a list to hold results
times = []

# Process each cost-surface
for dim in dims:
    # Create uniform cost-surface
    cs = np.ones((dim, dim))    
    # Repeat 10 times and take mean time
    repeats = 10
    timeTaken = 0
    for i in range(repeats):
        # Start clock
        startTime = time.clock()    
        # Create landscape graph and calculate least-cost distance from corner
        lg = graph.MCP_Geometric(cs)
        lcd = lg.find_costs(starts=[(0,0)])[0]
        # End clock
        endTime = time.clock()
        # Calculate time taken and append to results
        timeTaken = timeTaken + (endTime - startTime)
    times.append(timeTaken / repeats)

# Plot results
plt.scatter(dims, times)
As you can see from the plotted results below, even on my laptop that only has 8GB of RAM, cost-surfaces with dimensions of over 6000 cells can be processed in less than a minute. From my experience with other GIS software, I think this speed is at least reasonably comparable, and certainly is fast enough to be used within my own research. Of course there is a power relationship between the dimensions and the speed, so there will eventually be an issue with processing cost-surfaces that are much bigger than those I have used here.