Introduction to Analysis of Variance - ANOVA

While learning about regression, we’ve been working with both continuous predictor and response variables. But in biology there are many times when we are working with categorical predictor variables and continuous responses. For example, let’s consider a fertilizer addition experiment, where the growth of some plant species is measured under a fertilizer versus no-fertilizer treatment. Here, the predictor variable (fertilizer addition or not) is a categorical variable.

Challenge

We have worked with this kind of simple case, where we have one-predictor variable which can take two values. What did we do in this case?

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data(iris)
t.test( x = filter(iris, Species == "setosa")$Petal.Length,
        y = filter(iris, Species == "versicolor")$Petal.Length )
## 
##  Welch Two Sample t-test
## 
## data:  filter(iris, Species == "setosa")$Petal.Length and filter(iris, Species == "versicolor")$Petal.Length
## t = -39.493, df = 62.14, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.939618 -2.656382
## sample estimates:
## mean of x mean of y 
##     1.462     4.260
anova(lm(data = filter(iris, Species != "virginica"), formula = Petal.Length ~ Species))
## Analysis of Variance Table
## 
## Response: Petal.Length
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Species    1 195.720 195.720  1559.7 < 2.2e-16 ***
## Residuals 98  12.298   0.125                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we have more than two values within a single predictor variable, then we cannot use a t-test. Here we go to an ANOVA.

anova(lm(data = iris, formula = Petal.Length ~ Species))
## Analysis of Variance Table
## 
## Response: Petal.Length
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Species     2 437.10 218.551  1180.2 < 2.2e-16 ***
## Residuals 147  27.22   0.185                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visual understanding of ANOVA

First, recall the visual understanding of linear regression:

Logan - Figure 8.3

Logan - Figure 8.3

Now, what would this look like if the x value (i.e., the predictor) was categorical?

Logan - Figure 10.2

Logan - Figure 10.2

Fixed versus Random Effects

Fixed factors are ones we have manipulated, or that we expect a priori to have an effect on the response. From Logan, p 254 - “conclusions about the effects of a fixed factor are restricted to the specific treatment levels investigated …”

Random factors are, as per Logan, p. 254, “randomly chosen from all the possible levels of populations and are used as random representatives of the populations”. E.g., density of plants could be random. Or position of camera traps on a hillside.

Null hypotheses

Fixed effects model:

\[ H_0: \mu_1 = \mu_2 = ... = \mu_i = \mu \]

Random effects model:

\[ H_0: \sigma^2_1 = \sigma^2_2 = ... = \sigma^2_i = \sigma^2 \]

One-way ANOVA

From Medley and Clements (1998), by way of Logan 2010 - investigation of the impact of zinc contamination on diversity of diatom species in the USA Rocky Mountains.

diatoms <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter10/Data/medley.csv")
## Look at the data
head(diatoms)
##   STREAM ZINC DIVERSITY
## 1  Eagle BACK      2.27
## 2  Eagle HIGH      1.25
## 3  Eagle HIGH      1.15
## 4  Eagle  MED      1.62
## 5   Blue BACK      1.70
## 6   Blue HIGH      0.63
summary(aov(data = diatoms, DIVERSITY ~ ZINC ))
##             Df Sum Sq Mean Sq F value Pr(>F)  
## ZINC         3  2.567  0.8555   3.939 0.0176 *
## Residuals   30  6.516  0.2172                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Make the linear model
diatom_lm <- lm(data = diatoms, DIVERSITY ~ ZINC )

## Look at the ANOVA results
anova(diatom_lm)
## Analysis of Variance Table
## 
## Response: DIVERSITY
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## ZINC       3 2.5666 0.85554  3.9387 0.01756 *
## Residuals 30 6.5164 0.21721                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Look at the linear model results
summary(diatom_lm)
## 
## Call:
## lm(formula = DIVERSITY ~ ZINC, data = diatoms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03750 -0.22896  0.07986  0.33222  0.79750 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.79750    0.16478  10.909 5.81e-12 ***
## ZINCHIGH    -0.51972    0.22647  -2.295   0.0289 *  
## ZINCLOW      0.23500    0.23303   1.008   0.3213    
## ZINCMED     -0.07972    0.22647  -0.352   0.7273    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4661 on 30 degrees of freedom
## Multiple R-squared:  0.2826, Adjusted R-squared:  0.2108 
## F-statistic: 3.939 on 3 and 30 DF,  p-value: 0.01756
## Look at the model diagnostics
plot(diatom_lm)

Post-hoc comparisons

Now we can say that the biodiversity measures taken at different zinc concentrations DO NOT come from a common distribution, but we don’t know which are different from each other.

We could compare each concentration to each of the others separately, but recall the problems with multiple comparions. Instead, we’ll use a post-hoc comparison method. Specifically, we’ll use the Tukey Honest Significant Differences (HSD), but there are several other methods you might use as well (see Logan p. 265).

TukeyHSD(aov(data = diatoms, DIVERSITY ~ ZINC ))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DIVERSITY ~ ZINC, data = diatoms)
## 
## $ZINC
##                  diff        lwr        upr     p adj
## HIGH-BACK -0.51972222 -1.1355064 0.09606192 0.1218677
## LOW-BACK   0.23500000 -0.3986367 0.86863665 0.7457444
## MED-BACK  -0.07972222 -0.6955064 0.53606192 0.9847376
## LOW-HIGH   0.75472222  0.1389381 1.37050636 0.0116543
## MED-HIGH   0.44000000 -0.1573984 1.03739837 0.2095597
## MED-LOW   -0.31472222 -0.9305064 0.30106192 0.5153456

Note, that for reasons that have never been clear to me, you can only use the TukeyHSD method when using the aov function. For an alternative approach, see the glht function in the multcomp package. This is described in Logan (e.g., p. 267).

Planned comparisons

Where as unplanned, post-hoc comparisons require a correction for multiple comparisons, planned comparisons do not. We will not cover planned comparisons in class, but they may show up in the homework or on the final. Example 10B in Logan provides an example of the use of planned comparisons.

Non-parametric ANOVA

As with many of the other methods we’ve discussed, there are also non-parameteric ANOVA approachs. The most common is the Kruskal-Wallace test, which uses rank based statistics.

Nested ANOVA

Sometimes your replicates aren’t quite independent.

See visual as per Logan, Fig 11.1.

For formula, see Logan p. 284 and for calculation of F-ratios see Table 11.1. Note that Mean-Squared Errors are the same if B is random or fixed, but F-ratio is differen! Why? Discuss.

Unbalanced designes

Often your observations will not be clean. This is unfortunate, as it makes the analyses more complicated. But they are still doable.

Nested ANOVA Example

From Logan Example 11.A - Density-dependent grazing effects of sea urchin on filamentous algea (Andrew and Underwood 1993).

Levels: 1. Sea urchin density 2. Patches of Reef 3. Quadrats within patches

Get the data and have a look.

algae_density <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter11/Data/andrew.csv")

head(algae_density)
##   TREAT PATCH QUAD ALGAE
## 1    0%     1    1    46
## 2    0%     1    2    44
## 3    0%     1    3    41
## 4    0%     1    4    29
## 5    0%     1    5    11
## 6    0%     2    1    65
summary(algae_density)
##   TREAT        PATCH            QUAD       ALGAE      
##  0%  :20   Min.   : 1.00   Min.   :1   Min.   : 0.00  
##  100%:20   1st Qu.: 4.75   1st Qu.:2   1st Qu.: 0.00  
##  33% :20   Median : 8.50   Median :3   Median : 5.00  
##  66% :20   Mean   : 8.50   Mean   :3   Mean   :20.26  
##            3rd Qu.:12.25   3rd Qu.:4   3rd Qu.:41.00  
##            Max.   :16.00   Max.   :5   Max.   :83.00

Patch and Quad need to be converted to factors!

algae_density$PATCH <- factor(algae_density$PATCH)
algae_density$QUAD <- factor(algae_density$QUAD)

summary(algae_density)
##   TREAT        PATCH    QUAD       ALGAE      
##  0%  :20   1      : 5   1:16   Min.   : 0.00  
##  100%:20   2      : 5   2:16   1st Qu.: 0.00  
##  33% :20   3      : 5   3:16   Median : 5.00  
##  66% :20   4      : 5   4:16   Mean   :20.26  
##            5      : 5   5:16   3rd Qu.:41.00  
##            6      : 5          Max.   :83.00  
##            (Other):50

Much better.

Let’s make some visuals.

library(ggplot2)

ggplot(data = algae_density) +
  geom_boxplot(aes(x = TREAT, y = ALGAE)) +
  theme_bw()

We should also consider this from the perspective to collapsing the quadrats into their mean values.

library(dplyr)

algae_density_agg <-
  algae_density %>%
  group_by(TREAT, PATCH) %>%
  summarize(ALGAE_QUAD_MEAN = mean(ALGAE))

ggplot(data = algae_density_agg) +
  geom_boxplot(aes(x = TREAT, y = ALGAE_QUAD_MEAN)) +
  theme_bw()

Construct an ANOVA model and look at the results.

algae_density_aov <- aov(data = algae_density, ALGAE ~ TREAT + Error(PATCH))

summary(algae_density_aov)
## 
## Error: PATCH
##           Df Sum Sq Mean Sq F value Pr(>F)  
## TREAT      3  14429    4810   2.717 0.0913 .
## Residuals 12  21242    1770                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 64  19110   298.6
  • What can we conclude?
  • What was the unit of replication?

Factorial ANOVA

What if we want to look at two, or more, factors where one is not nested in the other?

See visuals as per Logan, Fig 12.1 and 12.2.

For formula, see Logan p. 314 and for calculation of F-ratios see Table 12.1.

Example - Two factor fixed effects model

Effects of density and season on the number of eggs produced by slipper limpets.

limpets1 <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter12/Data/quinn1.csv")

head(limpets1)
##   DENSITY SEASON  EGGS
## 1       6 spring 1.167
## 2       6 spring 0.500
## 3       6 spring 1.667
## 4       6 summer 4.000
## 5       6 summer 3.830
## 6       6 summer 3.830
summary(limpets1)
##     DENSITY      SEASON       EGGS       
##  Min.   : 6   spring:9   Min.   :0.5000  
##  1st Qu.: 6   summer:9   1st Qu.:0.8748  
##  Median :12              Median :1.6485  
##  Mean   :14              Mean   :1.9484  
##  3rd Qu.:24              3rd Qu.:2.7075  
##  Max.   :24              Max.   :4.0000
limpets1$DENSITY <- factor(limpets1$DENSITY)
summary(limpets1)
##  DENSITY    SEASON       EGGS       
##  6 :6    spring:9   Min.   :0.5000  
##  12:6    summer:9   1st Qu.:0.8748  
##  24:6               Median :1.6485  
##                     Mean   :1.9484  
##                     3rd Qu.:2.7075  
##                     Max.   :4.0000
ggplot(data = limpets1) +
  geom_boxplot(aes(x = DENSITY, y = EGGS)) +
  theme_bw()

ggplot(data = limpets1) +
  geom_boxplot(aes(x = SEASON, y = EGGS)) +
  theme_bw()

ggplot(data = limpets1) +
  geom_boxplot(aes(x = DENSITY, y = EGGS, fill = SEASON)) +
  theme_bw()

And let’s run the model.

limpets1_lm <- lm(data = limpets1, formula = EGGS ~ SEASON * DENSITY)
summary(limpets1_lm)
## 
## Call:
## lm(formula = EGGS ~ SEASON * DENSITY, data = limpets1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.61133 -0.16167 -0.04217  0.09892  0.55567 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.1113333  0.2183934   5.089 0.000267 ***
## SEASONsummer            2.7753333  0.3088549   8.986 1.12e-06 ***
## DENSITY12              -0.0003333  0.3088549  -0.001 0.999157    
## DENSITY24              -0.4166667  0.3088549  -1.349 0.202221    
## SEASONsummer:DENSITY12 -0.9996667  0.4367868  -2.289 0.041029 *  
## SEASONsummer:DENSITY24 -1.4700000  0.4367868  -3.365 0.005617 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3783 on 12 degrees of freedom
## Multiple R-squared:  0.9301, Adjusted R-squared:  0.9009 
## F-statistic: 31.93 on 5 and 12 DF,  p-value: 1.561e-06
anova(limpets1_lm)
## Analysis of Variance Table
## 
## Response: EGGS
##                Df  Sum Sq Mean Sq F value    Pr(>F)    
## SEASON          1 17.1483 17.1483 119.845 1.336e-07 ***
## DENSITY         2  4.0019  2.0010  13.984 0.0007325 ***
## SEASON:DENSITY  2  1.6907  0.8454   5.908 0.0163632 *  
## Residuals      12  1.7170  0.1431                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example - Two factor fixed effects model (Model I)

limpets <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter12/Data/quinn.csv")

head(limpets)
##   DENSITY SEASON  EGGS
## 1       8 spring 2.875
## 2       8 spring 2.625
## 3       8 spring 1.750
## 4       8 summer 2.125
## 5       8 summer 1.500
## 6       8 summer 1.875
summary(limpets)
##     DENSITY         SEASON        EGGS       
##  Min.   : 8.00   spring:12   Min.   :0.3560  
##  1st Qu.:13.25   summer:12   1st Qu.:0.9165  
##  Median :22.50               Median :1.4330  
##  Mean   :24.50               Mean   :1.4717  
##  3rd Qu.:33.75               3rd Qu.:1.9227  
##  Max.   :45.00               Max.   :2.8750
limpets$DENSITY <- factor(limpets$DENSITY)

Let’s get some visuals.

ggplot(data = limpets) +
  geom_boxplot(aes(x = DENSITY, y = EGGS)) +
  theme_bw()

ggplot(data = limpets) +
  geom_boxplot(aes(x = SEASON, y = EGGS)) +
  theme_bw()

ggplot(data = limpets) +
  geom_boxplot(aes(x = DENSITY, y = EGGS, fill = SEASON)) +
  theme_bw()

And let’s run the model.

limpets_lm <- lm(data = limpets, formula = EGGS ~ SEASON * DENSITY)
summary(limpets_lm)
## 
## Call:
## lm(formula = EGGS ~ SEASON * DENSITY, data = limpets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6667 -0.2612 -0.0610  0.2292  0.6647 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.41667    0.24642   9.807  3.6e-08 ***
## SEASONsummer           -0.58333    0.34849  -1.674  0.11358    
## DENSITY15              -0.23933    0.34849  -0.687  0.50206    
## DENSITY30              -0.85133    0.34849  -2.443  0.02655 *  
## DENSITY45              -1.21700    0.34849  -3.492  0.00301 ** 
## SEASONsummer:DENSITY15 -0.41633    0.49284  -0.845  0.41069    
## SEASONsummer:DENSITY30 -0.17067    0.49284  -0.346  0.73363    
## SEASONsummer:DENSITY45 -0.02367    0.49284  -0.048  0.96229    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4268 on 16 degrees of freedom
## Multiple R-squared:  0.749,  Adjusted R-squared:  0.6392 
## F-statistic: 6.822 on 7 and 16 DF,  p-value: 0.000745
anova(limpets_lm)
## Analysis of Variance Table
## 
## Response: EGGS
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## SEASON          1 3.2502  3.2502 17.8419 0.0006453 ***
## DENSITY         3 5.2841  1.7614  9.6691 0.0007041 ***
## SEASON:DENSITY  3 0.1647  0.0549  0.3014 0.8239545    
## Residuals      16 2.9146  0.1822                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Problems from unbalanced design

See Logan Fig 12.3, p. 325. The variance in a model is “explained” sequentially.

Provided there’s time, we’ll play with the models above.