Matthew Aiello-Lammens Ecologist at Work

Examining whether the order of scaling and log-transformation matters (archive)

The following question has come up as I continue to explore datasets related to my current PostDoc work. Given a dateset that requires log-transformation in order to fit a normal distribution, does it matter if I log-transform then scale the data versus scale then log-transform?
First, let's get some data that could be considered as needing log-transformation to meet the assumptions of normality. I downloaded the dataset of plant specific leaf area (SLA) from Reich 1999, as used and cited in Logan 2012, from the websit associated with Logan 2012 here.
## Require packages
require(ggplot2)
## Loading required package: ggplot2
require(reshape2)
## Loading required package: reshape2

## Read in the data
LeafArea <- read.csv("~/Google Drive/Professional/Short-R-Examples/reich.csv")

## Quick peek at the data
head(LeafArea)
##   LOCATION FUNCTION LEAFAREA
## 1 Newmex Shrub 105.0
## 2 Newmex Tree 124.0
## 3 Newmex Tree 83.8
## 4 Newmex Shrub 39.7
## 5 Newmex Shrub 51.2
## 6 Newmex Shrub 66.0

## Make a histogram of the LeafArea
qplot(LEAFAREA, data = LeafArea, geom = "histogram", binwidth = 10)
plot of chunk unnamed-chunk-1

## Now look at this same data, log10 transformed
qplot(log10(LEAFAREA), data = LeafArea, geom = "histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.
## Warning: position_stack requires constant width: output may be incorrect
plot of chunk unnamed-chunk-1
Ok, now lets see how things look when I scale, then log10 tranform, versus log10 transform, then scale.
## First scale the log transform
LeafArea$ScaleLog <- log10(scale(LeafArea$LEAFAREA))
## Warning: NaNs produced

## Next Lof then scale
LeafArea$LogScale <- scale(log10(LeafArea$LEAFAREA))

## Now plot these two
LeafArea_m <- melt(data = LeafArea, id.vars = c(1:3))
p <- ggplot(LeafArea_m, aes(x = value, colour = variable)) + geom_density()
p
## Warning: Removed 33 rows containing non-finite values (stat_density).
## Warning: Removed 1 rows containing non-finite values (stat_density).
plot of chunk unnamed-chunk-2
Defnitely different in the density plots. What about histograms?
h <- ggplot(LeafArea_m, aes(x = value, fill = variable)) + geom_histogram(position = "identity", 
alpha = 0.4)
h
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.
plot of chunk unnamed-chunk-3

Let's try this one more time with simulate data.
# Generate some random data that needs a log transform
Sample_Data <- exp(rnorm(n = 500, mean = 0, sd = 1))
# Plot the data before transform
qplot(x = Sample_Data, geom = "histogram", binwidth = 1)
plot of chunk unnamed-chunk-4
# Plot the data after transform
qplot(x = log(Sample_Data), geom = "histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.
plot of chunk unnamed-chunk-4

# Now compare Scaling then log transform vs Log transform then scaling
Sample_Data_Test <- data.frame(ScaleLog = log(scale(Sample_Data)), LogScale = scale(log(Sample_Data)))
## Warning: NaNs produced

h <- ggplot(melt(Sample_Data_Test), aes(x = value, fill = variable)) + geom_histogram(position = "identity",
alpha = 0.4)
## Using as id variables
h
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.
plot of chunk unnamed-chunk-4
I'm not sure what to make of this. It's clear that the order of scaling and log transforming matters. However, I'm not sure which order makes more sense. It certainly seems that Log then Scale produces a nice centered distribution. Though the result seems a bit leptokurtic.
However, one observation that is very clear is that by scaling the data first, I ended up with many values equal to 0, which when then log transformed were assigned NA. This happened in both examples. This definitely leads me to think that the order to do thins is Log then Scale.