A geoinformatics notebook

Thursday, January 19, 2017

Least-cost modelling with Python using scikit-image

I've done a lot of least-cost modelling over the years, and I'm always interested to find software that enables this method to be applied in an open a way as possible - which for me means Python and R code! So I was excited to stumble across some functionality in the scikit-image package that has a graph module that does this. While the scikit-image package is concerned with image processing, GIS raster datasets are of an identical form to an image, as both can be represented as a 2-dimensional NumPy array. As the authors of the package explain in their paper that describes scikit-image, using NumPy arrays as the basis for scikit-image maximises compatibility with other Python packages (van der Walt et al. 2014), which would include most other geospatial Python packages. While scikit-image uses slightly different terminology, talking about "minimum cost paths" rather than "least-cost paths" as we tend to do in the geography/ecology world that I live in, on paper the approach looks identical to those commonly implemented in GIS software as it applies Dijkstra's algorithm with diagonal connectivity between cells. But before leaping into adopting this approach, I thought that while experimenting with scikit-image it would be wise to double-check that the results are the same as I would expect through more conventional GIS approaches, so I wrote some code:

# Import packages
import numpy as np
from skimage import graph
import nlmpy

# Set random seed so results are the same each time script runs
np.random.seed(1)

# Create a hypothetical cost-surface using the midpoint displacement method
cs = nlmpy.mpd(101, 101, h=1) * 5
# Make any values below one equal to no data
np.place(cs, cs < 1, -9999)
# Specify the size of the cell
cellSize = 10

# From the cost-surface create a 'landscape graph' object which can then be
# analysed using least-cost modelling
lg = graph.MCP_Geometric(cs, sampling=(cellSize, cellSize))

# Choose a starting cell location
startCell = (5, 5)

# Calculate the least-cost distance from the start cell to all other cells
lcd = lg.find_costs(starts=[startCell])[0]

# Export the data for comparison in GIS software
startGrid = np.zeros((101,101)) - 9999
startGrid[5,5] = 1
nlmpy.exportASCIIGrid("start.asc", startGrid, cellSize = 10)
nlmpy.exportASCIIGrid("cost-surface.asc", cs, cellSize = 10)
np.place(lcd, np.isinf(lcd), -9999) # insert no data values
nlmpy.exportASCIIGrid("least-cost-distances.asc", lcd, cellSize = 10)

I then took the output files created here, and processed them in ArcGIS using the cost distance tool. The results from both the scikit-image approach I coded and the ArcGIS results are shown in the picture at the top of the page. The key take home point is that they are, with the exception of some rounding issues beyond the third decimal point, identical. So I'm now confident in saying that the graph module of scikit-image does allow you to do least-cost modelling.

Of course it's one thing to verify the results, the other consideration here is computational speed. Faster is obviously better in general terms, but I'm particularly keen to have efficient least-cost modelling software as it enables better estimates of uncertainty to be calculated through iterative modelling. Recognising and quantifying the uncertainty associated with least-cost modelling is, I think, a very important area of research for the future - I discuss this issue in more detail in a recent review paper (Etherington 2016). So in addition to verify that the results are the same, I also ran some code to assess speed of processing:

# Import packages
import numpy as np
from skimage import graph
import time
import matplotlib.pyplot as plt

# Specify the cost-surface dimensions to process
dims = [100, 200, 400, 800, 1600, 3200, 6400]

# Create a list to hold results
times = []

# Process each cost-surface
for dim in dims:
    print(dim)
    # Create uniform cost-surface
    cs = np.ones((dim, dim))    
    # Repeat 10 times and take mean time
    repeats = 10
    timeTaken = 0
    for i in range(repeats):
        print(i)
        # Start clock
        startTime = time.clock()    
        # Create landscape graph and calculate least-cost distance from corner
        lg = graph.MCP_Geometric(cs)
        lcd = lg.find_costs(starts=[(0,0)])[0]
        # End clock
        endTime = time.clock()
        # Calculate time taken and append to results
        timeTaken = timeTaken + (endTime - startTime)
    times.append(timeTaken / repeats)

# Plot results
plt.scatter(dims, times)
plt.show()

As you can see from the plotted results below, even on my laptop that only has 8GB of RAM, cost-surfaces with dimensions of over 6000 cells can be processed in less than a minute. From my experience with other GIS software, I think this speed is at least reasonably comparable, and certainly is fast enough to be used within my own research. Of course there is a power relationship between the dimensions and the speed, so there will eventually be an issue with processing cost-surfaces that are much bigger than those I have used here.

References

Etherington TR (2016) Least-cost modelling and landscape ecology: concepts, applications, and opportunities. Current Landscape Ecology Reports 1:40-53.
van der Walt S, Schönberger JL, Nunez-Iglesias J, Boulogne F, Warner JD, Yager N, Gouillart E, Yu T, Scikit Image Contributors (2014) scikit-image: image processing in Python. PeerJ 2:e453.

Wednesday, May 04, 2016

When to use a lognormal distribution rather than a normal distribution

Recently a reviewer commented that perhaps a lognormal distribution rather than normal distribution should be used in order to avoid truncation of negative values. Now I had used lognormal distributions in the past, but only so that I could to describe a positively skewed distribution.

So I did some reading and found a paper by Limpert et al. (2001) from which I picked out a few key points:

The reviewer was quite correct, and unlike a normal distribution, a lognormal distribution is incapable of producing random numbers below zero.
As the mean gets larger and the standard deviation gets smaller, the normal and lognormal distributions become identical.
The two distributions are defined in a similar manner. The normal distribution is defined by the arithmetic mean and standard deviation, and the lognormal distribution is defined by the geometric mean and standard deviation.

To get my head around how the normal and lognormal distributions behaved, and how to generate and fit data in R, I knocked up some example scripts. I didn't fix the random seed value as I wanted to see how the patterns varied with different random samples, so if you run the scripts yourself you will get different results to those shown here, but the patterns will be broadly the same.

# Define some values to work with
v = c(9.36, 8.06, 7.93)
 
# Set up things for plotting
x = seq(-5, 20, 0.1)
png("example1.png", width=15/2.54, height=12/2.54, 
    res=100, units="in", pointsize=8)
par(oma=c(2,0,0,0), mar=c(3,3,2,0.5), mgp=c(2,0.55,0), mfrow=c(2,1))
 
# NORMAL DISTRIBUTION
# Calculate arithmetic mean and standard deviation
aM = mean(v)
aS = sd(v)
# Generate some random numbers
normal = rnorm(1000, mean = aM, sd = aS)
# Plot a histogram
hist(normal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="dodgerblue", border="white",
     main="Normal", xlab="Values")
# Create a probability density values and add to plot
normY = dnorm(x, mean = aM, sd = aS)
par(new=TRUE)
plot(x, normY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
# LOGNORMAL DISTRIBUTION
# Calculate geometric mean and standard deviation
gM = exp(mean(log(v)))
gS = exp(sd(log(v)))
# Generate some random numbers
lognormal = rlnorm(1000, meanlog = log(gM), sdlog = log(gS))
# Plot a histogram
hist(lognormal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="gold", border="white",
     main="Lognormal", xlab="Values")
# Create a probability density values and add to plot
lognormY = dlnorm(x, mean = log(gM), sd = log(gS))
par(new=TRUE)
plot(x, lognormY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
dev.off()

In this example we can see that the two distributions describe the data in an almost identical manner, so either could be used. However, when the values are closer to zero, and the variation becomes larger, there is a marked difference in behaviour.

# Define some values to work with
v = c(2.49, 0.93, 4.08)
 
# Set up things for plotting
x = seq(-5, 20, 0.1)
png("example2.png", width=15/2.54, height=12/2.54, 
    res=100, units="in", pointsize=8)
par(oma=c(2,0,0,0), mar=c(3,3,2,0.5), mgp=c(2,0.55,0), mfrow=c(2,1))
 
# NORMAL DISTRIBUTION
# Calculate arithmetic mean and standard deviation
aM = mean(v)
aS = sd(v)
# Generate some random numbers
normal = rnorm(1000, mean = aM, sd = aS)
# Plot a histogram
hist(normal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="dodgerblue", border="white",
     main="Normal", xlab="Values")
# Create a probability density values and add to plot
normY = dnorm(x, mean = aM, sd = aS)
par(new=TRUE)
plot(x, normY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
# LOGNORMAL DISTRIBUTION
# Calculate geometric mean and standard deviation
gM = exp(mean(log(v)))
gS = exp(sd(log(v)))
# Generate some random numbers
lognormal = rlnorm(1000, meanlog = log(gM), sdlog = log(gS))
# Plot a histogram
hist(lognormal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="gold", border="white",
     main="Lognormal", xlab="Values")
# Create a probability density values and add to plot
lognormY = dlnorm(x, mean = log(gM), sd = log(gS))
par(new=TRUE)
plot(x, lognormY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
dev.off()

Now the randomly generated numbers from the normal distribution contain values that are negative, while in contrast the lognormal distribution produces values that are all above zero. This is quite a handy property for modelling, as there are occasions where data such as species abundances or total rainfall cannot be negative. If a normal distribution were used there would be the potential for illogical negative values to be randomly generated – which could cause merry chaos within a model!

So the take home message would seem to be that if your data should always be positive, then use a lognormal distribution rather than a normal distribution. This will prevent any chance of negative values being generated, and as the mean becomes larger and the standard deviation becomes smaller the lognormal distribution becomes identical to a normal distribution anyway.

Of course that is all well and good when you are modelling from your own data. But what if you want to use other published information that has been described in terms of an arithmetic mean and standard deviation rather than the geometric mean and standard deviation that you need for a lognormal distribution? Well, you can also estimate the geometric mean and standard deviation from the arithmetic mean and standard deviation. To demonstrate the following compares a lognormal distribution generated from some actual data and from a reported arithmetic mean and standard deviation.

# Define some values to work with
v = c(2.49, 0.93, 4.08)
 
# Set up things for plotting
x = seq(-5, 20, 0.1)
png("example3.png", width=15/2.54, height=12/2.54, 
    res=100, units="in", pointsize=8)
par(oma=c(2,0,0,0), mar=c(3,3,2,0.5), mgp=c(2,0.55,0), mfrow=c(2,1))
 
# LOGNORMAL DISTRIBUTION FROM CALCULATED VALUES
# Calculate geometric mean and standard deviation
gM = exp(mean(log(v)))
gS = exp(sd(log(v)))
# Generate some random numbers
lognormal = rlnorm(1000, meanlog = log(gM), sdlog = log(gS))
# Plot a histogram
hist(lognormal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="dodgerblue", border="white",
     main="Lognormal calculated", xlab="Values")
# Create a probability density values and add to plot
lognormY = dlnorm(x, mean = log(gM), sd = log(gS))
par(new=TRUE)
plot(x, lognormY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
# LOGNORMAL DISTRIBUTION FROM ESTIMATED VALUES
# Calculate arithmetic mean and standard deviation
aM = mean(v)
aS = sd(v)
# Estimate the geometric mean and standard deviation
gEstM = aM / sqrt(1 + (aS / aM) ^ 2)
gEstS = exp(sqrt(log(1 + (aS / aM) ^ 2)))
# Generate some random numbers
lognormal = rlnorm(1000, meanlog = log(gEstM), sdlog = log(gEstS))
# Plot a histogram
hist(lognormal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="gold", border="white",
     main="Lognormal estimated", xlab="Values")
# Create a probability density values and add to plot
lognormY = dlnorm(x, mean = log(gEstM), sd = log(gEstS))
par(new=TRUE)
plot(x, lognormY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
dev.off()

As can be seen, the two distributions are very similar. So while it would obviously be better to get the original data and base a lognormal distribution from a calculated geometric mean and standard deviation, if all that is available is a reported arithmetic mean and standard deviation, it should still be possible to produce a reasonable lognormal distribution for inclusion within a model.

References

Limpert E, Stahel WA, Abbt M (2001) Log-normal distributions across the sciences: keys and clues. Bioscience 51: 341-352.

Thursday, April 21, 2016

Archiving raster data: choosing a raster and compression file format

I'm going to be generating a large number of raster datasets in an upcoming project, and I want to archive the data such that it is as openly accessible as possible. As well as choosing a place to store the data (a future post on this methinks) I need to decide the on the file formats that I use. There seem to be two file format considerations here: the raster file format, and the compression file format.

I have experienced not being able to make use of shared data in the past due to the data being stored in a redundant software file format. Therefore, I have developed a preference towards using simple plain text file formats for data archiving. Plain text files may not be the most efficient file format for actually analysing the data, but when archived in plain text, the data can always be accessed and converted into something more suitable to an individual workflow.

The Esri ASCII raster (.asc) format is a non-proprietary plain text way of storing raster data. I've been using .asc files for years as they are supported by lots of GIS programs such as ArcGIS, QGIS, and IDRISI, and can be easily imported into Python and R. A downside is that .asc files do not contain coordinate system information, so .asc files do need to be accompanied by supporting information that describes the coordinate system used.

As I'm concerned about archiving a large number of raster datasets, file compression is also going to be important. The archived files need to be compressed as much as possible so that requirements for both the storage space and file transfer and downloads and be reduced. Having looked around on the web for advice, there is a wide variety of different views as to what is the best file compression format. So I explored the efficiency of .asc file compression for what seemed to be some of the more popular approaches. As my concern is the archiving of open data, I did not consider factors such as compression speeds or encryption, all I was concerned about was the compression ratio:

My results would suggest that for an .asc file, the .7z compression format provides the best compression ratio. Use of the .7z format also has other advantages for archiving open data. The .7z compression format is open-source, which means that it is supported by a wide range of compression software, and the freely available 7zip software can be used on any computer operating system. This means that potential users of the data should always be able to easily decompress the data once they have received it.

Thursday, April 23, 2015

Get the maximum and minimum values from multiple 2-dimensional NumPy arrays

A classic GIS question you might ask is: "What is the maximum or minimum cell value from a set of multiple overlapping rasters?". When your raster data is in the form of two 2-dimensional NumPy arrays, this is very straightforward as you can just use the numpy.maximum or numpy.minimum functions to get a new array with the minimum and maximum values. However, if you have more than two arrays you need another solution.

To start, lets create the arrays shown in the picture at the top of this post:

import numpy as np
# Create some arrays to stack
a = np.array([[0,1,2],[3,4,5],[6,7,8],[4,6,7]])
b = np.array([[5,4,8],[6,3,9],[1,5,9],[0,0,0]])
c = np.array([[5,1,9],[5,9,7],[3,1,5],[3,9,1]])

The next step is to use the numpy.dstack function to stack the arrays depth wise - which in a GIS context is to say on top of each other:

# Stack the arrays
abc = np.dstack((a,b,c))

It is at this point that things get a little more complicated. What we can do is use the numpy.argmax or numpy.argmin functions to pull out not the minimum and maximum values themselves, but rather an index to the location of the minimum and maximum values within the array stack. The axis argument is set to 2 so that the index is for minimum and maximum values going depth wise into the stack (the first two axes would relate to looking at the stack in terms of top to bottom or left to right):

# Get the index values for the minimum and maximum values
maxIndex = np.argmax(abc, axis=2)
minIndex = np.argmin(abc, axis=2)

Having got the index values for the minimum and maximum values in terms of their depth into the array stack, we just need to create two other arrays that can act as the matching row and column index positions:

# Create column and row position arrays
nRow, nCol = np.shape(a)
col, row = np.meshgrid(range(nCol), range(nRow))

And then its a simple process of indexing out the maximum and minimum values within the array stack to create new arrays containing the minimum and maximum values as shown in the picture at the top of this post:

# Index out the maximum and minimum values from the stacked array based on row 
# and column position and the maximum value
maxValue = abc[row, col, maxIndex]
minValue = abc[row, col, minIndex]

Putting the whole process together we get:

import numpy as np

# Create some arrays to stack
a = np.array([[0,1,2],[3,4,5],[6,7,8],[4,6,7]])
b = np.array([[5,4,8],[6,3,9],[1,5,9],[0,0,0]])
c = np.array([[5,1,9],[5,9,7],[3,1,5],[3,9,1]])
# Stack the arrays
abc = np.dstack((a,b,c))

# Get the index values for the minimum and maximum values
maxIndex = np.argmax(abc, axis=2)
minIndex = np.argmin(abc, axis=2)

# Create column and row position arrays
nRow, nCol = np.shape(a)
col, row = np.meshgrid(range(nCol), range(nRow))

# Index out the maximum and minimum values from the stacked array based on row 
# and column position and the maximum value
maxValue = abc[row, col, maxIndex]
minValue = abc[row, col, minIndex]

Saturday, April 18, 2015

ArcGIS integrates with the Python SciPy software stack

I was excited to read in the latest ArcUser Magazine that the Python SciPy stack will be integrated from version 10.3 of ArcGIS Pro and version 10.3.1 of ArcGIS Desktop. I think this is great news for making the power of Python for scientific computing more accessible to users of ArcGIS.

I initially started Python programming with the arcgisscripting module of version 9.2 of ArcGIS, and initially did research using only this approach (Etherington 2011). However, from those early beginnings I quickly learnt that Python has massive potential for scientific analysis and visualisation, and I've been making use of the NumPy, SciPy, and Matplotlib packages for many years. As I discovered more and more about what these packages could do I even discovered that I could take a purely Python approach for some of my research (Etherington et al. 2015).

So I would urge users of ArcGIS who haven’t programmed with these newly available Python packages to seriously consider spending some time learning how to make use of them.

Note to self - must make time to learn how to make better use of Pandas...

References

Etherington TR (2011) Python based GIS tools for landscape genetics: visualising genetic relatedness and measuring landscape connectivity. Methods in Ecology and Evolution 2(1): 52-55.

Etherington TR, Holland EP, O'Sullivan D (2015) NLMpy: a Python software package for the creation of neutral landscape models within a general numerical framework. Methods in Ecology and Evolution 6(2): 164-168.