Wednesday, May 04, 2016

When to use a lognormal distribution rather than a normal distribution

Recently a reviewer commented that perhaps a lognormal distribution rather than normal distribution should be used in order to avoid truncation of negative values. Now I had used lognormal distributions in the past, but only so that I could to describe a positively skewed distribution.

So I did some reading and found a paper by Limpert et al. (2001) from which I picked out a few key points:

  1. The reviewer was quite correct, and unlike a normal distribution, a lognormal distribution is incapable of producing random numbers below zero.
  2. As the mean gets larger and the standard deviation gets smaller, the normal and lognormal distributions become identical.
  3. The two distributions are defined in a similar manner. The normal distribution is defined by the arithmetic mean and standard deviation, and the lognormal distribution is defined by the geometric mean and standard deviation.

To get my head around how the normal and lognormal distributions behaved, and how to generate and fit data in R, I knocked up some example scripts. I didn't fix the random seed value as I wanted to see how the patterns varied with different random samples, so if you run the scripts yourself you will get different results to those shown here, but the patterns will be broadly the same.

# Define some values to work with
v = c(9.36, 8.06, 7.93)
 
# Set up things for plotting
x = seq(-5, 20, 0.1)
png("example1.png", width=15/2.54, height=12/2.54, 
    res=100, units="in", pointsize=8)
par(oma=c(2,0,0,0), mar=c(3,3,2,0.5), mgp=c(2,0.55,0), mfrow=c(2,1))
 
# NORMAL DISTRIBUTION
# Calculate arithmetic mean and standard deviation
aM = mean(v)
aS = sd(v)
# Generate some random numbers
normal = rnorm(1000, mean = aM, sd = aS)
# Plot a histogram
hist(normal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="dodgerblue", border="white",
     main="Normal", xlab="Values")
# Create a probability density values and add to plot
normY = dnorm(x, mean = aM, sd = aS)
par(new=TRUE)
plot(x, normY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
# LOGNORMAL DISTRIBUTION
# Calculate geometric mean and standard deviation
gM = exp(mean(log(v)))
gS = exp(sd(log(v)))
# Generate some random numbers
lognormal = rlnorm(1000, meanlog = log(gM), sdlog = log(gS))
# Plot a histogram
hist(lognormal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="gold", border="white",
     main="Lognormal", xlab="Values")
# Create a probability density values and add to plot
lognormY = dlnorm(x, mean = log(gM), sd = log(gS))
par(new=TRUE)
plot(x, lognormY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
dev.off()

In this example we can see that the two distributions describe the data in an almost identical manner, so either could be used. However, when the values are closer to zero, and the variation becomes larger, there is a marked difference in behaviour.

# Define some values to work with
v = c(2.49, 0.93, 4.08)
 
# Set up things for plotting
x = seq(-5, 20, 0.1)
png("example2.png", width=15/2.54, height=12/2.54, 
    res=100, units="in", pointsize=8)
par(oma=c(2,0,0,0), mar=c(3,3,2,0.5), mgp=c(2,0.55,0), mfrow=c(2,1))
 
# NORMAL DISTRIBUTION
# Calculate arithmetic mean and standard deviation
aM = mean(v)
aS = sd(v)
# Generate some random numbers
normal = rnorm(1000, mean = aM, sd = aS)
# Plot a histogram
hist(normal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="dodgerblue", border="white",
     main="Normal", xlab="Values")
# Create a probability density values and add to plot
normY = dnorm(x, mean = aM, sd = aS)
par(new=TRUE)
plot(x, normY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
# LOGNORMAL DISTRIBUTION
# Calculate geometric mean and standard deviation
gM = exp(mean(log(v)))
gS = exp(sd(log(v)))
# Generate some random numbers
lognormal = rlnorm(1000, meanlog = log(gM), sdlog = log(gS))
# Plot a histogram
hist(lognormal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="gold", border="white",
     main="Lognormal", xlab="Values")
# Create a probability density values and add to plot
lognormY = dlnorm(x, mean = log(gM), sd = log(gS))
par(new=TRUE)
plot(x, lognormY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
dev.off()

Now the randomly generated numbers from the normal distribution contain values that are negative, while in contrast the lognormal distribution produces values that are all above zero. This is quite a handy property for modelling, as there are occasions where data such as species abundances or total rainfall cannot be negative. If a normal distribution were used there would be the potential for illogical negative values to be randomly generated – which could cause merry chaos within a model!

So the take home message would seem to be that if your data should always be positive, then use a lognormal distribution rather than a normal distribution. This will prevent any chance of negative values being generated, and as the mean becomes larger and the standard deviation becomes smaller the lognormal distribution becomes identical to a normal distribution anyway.

Of course that is all well and good when you are modelling from your own data. But what if you want to use other published information that has been described in terms of an arithmetic mean and standard deviation rather than the geometric mean and standard deviation that you need for a lognormal distribution? Well, you can also estimate the geometric mean and standard deviation from the arithmetic mean and standard deviation. To demonstrate the following compares a lognormal distribution generated from some actual data and from a reported arithmetic mean and standard deviation.

# Define some values to work with
v = c(2.49, 0.93, 4.08)
 
# Set up things for plotting
x = seq(-5, 20, 0.1)
png("example3.png", width=15/2.54, height=12/2.54, 
    res=100, units="in", pointsize=8)
par(oma=c(2,0,0,0), mar=c(3,3,2,0.5), mgp=c(2,0.55,0), mfrow=c(2,1))
 
# LOGNORMAL DISTRIBUTION FROM CALCULATED VALUES
# Calculate geometric mean and standard deviation
gM = exp(mean(log(v)))
gS = exp(sd(log(v)))
# Generate some random numbers
lognormal = rlnorm(1000, meanlog = log(gM), sdlog = log(gS))
# Plot a histogram
hist(lognormal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="dodgerblue", border="white",
     main="Lognormal calculated", xlab="Values")
# Create a probability density values and add to plot
lognormY = dlnorm(x, mean = log(gM), sd = log(gS))
par(new=TRUE)
plot(x, lognormY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
# LOGNORMAL DISTRIBUTION FROM ESTIMATED VALUES
# Calculate arithmetic mean and standard deviation
aM = mean(v)
aS = sd(v)
# Estimate the geometric mean and standard deviation
gEstM = aM / sqrt(1 + (aS / aM) ^ 2)
gEstS = exp(sqrt(log(1 + (aS / aM) ^ 2)))
# Generate some random numbers
lognormal = rlnorm(1000, meanlog = log(gEstM), sdlog = log(gEstS))
# Plot a histogram
hist(lognormal, breaks=seq(-5,50,0.5), xlim=c(-5,20), 
     col="gold", border="white",
     main="Lognormal estimated", xlab="Values")
# Create a probability density values and add to plot
lognormY = dlnorm(x, mean = log(gEstM), sd = log(gEstS))
par(new=TRUE)
plot(x, lognormY, type = "l", xlim=c(-5,20), col="red", bty="n",
     xaxt="n", yaxt="n", ylab="", xlab="")
 
dev.off()

As can be seen, the two distributions are very similar. So while it would obviously be better to get the original data and base a lognormal distribution from a calculated geometric mean and standard deviation, if all that is available is a reported arithmetic mean and standard deviation, it should still be possible to produce a reasonable lognormal distribution for inclusion within a model.

References

Limpert E, Stahel WA, Abbt M (2001) Log-normal distributions across the sciences: keys and clues. Bioscience 51: 341-352.

Thursday, April 21, 2016

Archiving raster data: choosing a raster and compression file format

I'm going to be generating a large number of raster datasets in an upcoming project, and I want to archive the data such that it is as openly accessible as possible. As well as choosing a place to store the data (a future post on this methinks) I need to decide the on the file formats that I use. There seem to be two file format considerations here: the raster file format, and the compression file format.

I have experienced not being able to make use of shared data in the past due to the data being stored in a redundant software file format. Therefore, I have developed a preference towards using simple plain text file formats for data archiving. Plain text files may not be the most efficient file format for actually analysing the data, but when archived in plain text, the data can always be accessed and converted into something more suitable to an individual workflow.

The Esri ASCII raster (.asc) format is a non-proprietary plain text way of storing raster data. I've been using .asc files for years as they are supported by lots of GIS programs such as ArcGIS, QGIS, and IDRISI, and can be easily imported into Python and R. A downside is that .asc files do not contain coordinate system information, so .asc files do need to be accompanied by supporting information that describes the coordinate system used.

As I'm concerned about archiving a large number of raster datasets, file compression is also going to be important. The archived files need to be compressed as much as possible so that requirements for both the storage space and file transfer and downloads and be reduced. Having looked around on the web for advice, there is a wide variety of different views as to what is the best file compression format. So I explored the efficiency of .asc file compression for what seemed to be some of the more popular approaches. As my concern is the archiving of open data, I did not consider factors such as compression speeds or encryption, all I was concerned about was the compression ratio:

My results would suggest that for an .asc file, the .7z compression format provides the best compression ratio. Use of the .7z format also has other advantages for archiving open data. The .7z compression format is open-source, which means that it is supported by a wide range of compression software, and the freely available 7zip software can be used on any computer operating system. This means that potential users of the data should always be able to easily decompress the data once they have received it.