While learning about regression, we’ve been working with both continuous predictor and response variables. But in biology there are many times when we are working with categorical predictor variables and continuous responses. For example, let’s consider a fertilizer addition experiment, where the growth of some plant species is measured under a fertilizer versus no-fertilizer treatment. Here, the predictor variable (fertilizer addition or not) is a categorical variable.
We have worked with this kind of simple case, where we have one-predictor variable which can take two values. What did we do in this case?
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(iris)
t.test( x = filter(iris, Species == "setosa")$Petal.Length,
y = filter(iris, Species == "versicolor")$Petal.Length )
##
## Welch Two Sample t-test
##
## data: filter(iris, Species == "setosa")$Petal.Length and filter(iris, Species == "versicolor")$Petal.Length
## t = -39.493, df = 62.14, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.939618 -2.656382
## sample estimates:
## mean of x mean of y
## 1.462 4.260
anova(lm(data = filter(iris, Species != "virginica"), formula = Petal.Length ~ Species))
## Analysis of Variance Table
##
## Response: Petal.Length
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 1 195.720 195.720 1559.7 < 2.2e-16 ***
## Residuals 98 12.298 0.125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we have more than two values within a single predictor variable, then we cannot use a t-test. Here we go to an ANOVA.
anova(lm(data = iris, formula = Petal.Length ~ Species))
## Analysis of Variance Table
##
## Response: Petal.Length
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 437.10 218.551 1180.2 < 2.2e-16 ***
## Residuals 147 27.22 0.185
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
First, recall the visual understanding of linear regression:
Now, what would this look like if the x value (i.e., the predictor) was categorical?
Fixed factors are ones we have manipulated, or that we expect a priori to have an effect on the response. From Logan, p 254 - “conclusions about the effects of a fixed factor are restricted to the specific treatment levels investigated …”
Random factors are, as per Logan, p. 254, “randomly chosen from all the possible levels of populations and are used as random representatives of the populations”. E.g., density of plants could be random. Or position of camera traps on a hillside.
Fixed effects model:
\[ H_0: \mu_1 = \mu_2 = ... = \mu_i = \mu \]
Random effects model:
\[ H_0: \sigma^2_1 = \sigma^2_2 = ... = \sigma^2_i = \sigma^2 \]
From Medley and Clements (1998), by way of Logan 2010 - investigation of the impact of zinc contamination on diversity of diatom species in the USA Rocky Mountains.
diatoms <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter10/Data/medley.csv")
## Look at the data
head(diatoms)
## STREAM ZINC DIVERSITY
## 1 Eagle BACK 2.27
## 2 Eagle HIGH 1.25
## 3 Eagle HIGH 1.15
## 4 Eagle MED 1.62
## 5 Blue BACK 1.70
## 6 Blue HIGH 0.63
summary(aov(data = diatoms, DIVERSITY ~ ZINC ))
## Df Sum Sq Mean Sq F value Pr(>F)
## ZINC 3 2.567 0.8555 3.939 0.0176 *
## Residuals 30 6.516 0.2172
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Make the linear model
diatom_lm <- lm(data = diatoms, DIVERSITY ~ ZINC )
## Look at the ANOVA results
anova(diatom_lm)
## Analysis of Variance Table
##
## Response: DIVERSITY
## Df Sum Sq Mean Sq F value Pr(>F)
## ZINC 3 2.5666 0.85554 3.9387 0.01756 *
## Residuals 30 6.5164 0.21721
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Look at the linear model results
summary(diatom_lm)
##
## Call:
## lm(formula = DIVERSITY ~ ZINC, data = diatoms)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.03750 -0.22896 0.07986 0.33222 0.79750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.79750 0.16478 10.909 5.81e-12 ***
## ZINCHIGH -0.51972 0.22647 -2.295 0.0289 *
## ZINCLOW 0.23500 0.23303 1.008 0.3213
## ZINCMED -0.07972 0.22647 -0.352 0.7273
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4661 on 30 degrees of freedom
## Multiple R-squared: 0.2826, Adjusted R-squared: 0.2108
## F-statistic: 3.939 on 3 and 30 DF, p-value: 0.01756
## Look at the model diagnostics
plot(diatom_lm)
Now we can say that the biodiversity measures taken at different zinc concentrations DO NOT come from a common distribution, but we don’t know which are different from each other.
We could compare each concentration to each of the others separately, but recall the problems with multiple comparions. Instead, we’ll use a post-hoc comparison method. Specifically, we’ll use the Tukey Honest Significant Differences (HSD), but there are several other methods you might use as well (see Logan p. 265).
TukeyHSD(aov(data = diatoms, DIVERSITY ~ ZINC ))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DIVERSITY ~ ZINC, data = diatoms)
##
## $ZINC
## diff lwr upr p adj
## HIGH-BACK -0.51972222 -1.1355064 0.09606192 0.1218677
## LOW-BACK 0.23500000 -0.3986367 0.86863665 0.7457444
## MED-BACK -0.07972222 -0.6955064 0.53606192 0.9847376
## LOW-HIGH 0.75472222 0.1389381 1.37050636 0.0116543
## MED-HIGH 0.44000000 -0.1573984 1.03739837 0.2095597
## MED-LOW -0.31472222 -0.9305064 0.30106192 0.5153456
Note, that for reasons that have never been clear to me, you can only use the TukeyHSD method when using the aov
function. For an alternative approach, see the glht
function in the multcomp
package. This is described in Logan (e.g., p. 267).
Where as unplanned, post-hoc comparisons require a correction for multiple comparisons, planned comparisons do not. We will not cover planned comparisons in class, but they may show up in the homework or on the final. Example 10B in Logan provides an example of the use of planned comparisons.
As with many of the other methods we’ve discussed, there are also non-parameteric ANOVA approachs. The most common is the Kruskal-Wallace test, which uses rank based statistics.
Sometimes your replicates aren’t quite independent.
See visual as per Logan, Fig 11.1.
For formula, see Logan p. 284 and for calculation of F-ratios see Table 11.1. Note that Mean-Squared Errors are the same if B is random or fixed, but F-ratio is differen! Why? Discuss.
Often your observations will not be clean. This is unfortunate, as it makes the analyses more complicated. But they are still doable.
From Logan Example 11.A - Density-dependent grazing effects of sea urchin on filamentous algea (Andrew and Underwood 1993).
Levels: 1. Sea urchin density 2. Patches of Reef 3. Quadrats within patches
Get the data and have a look.
algae_density <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter11/Data/andrew.csv")
head(algae_density)
## TREAT PATCH QUAD ALGAE
## 1 0% 1 1 46
## 2 0% 1 2 44
## 3 0% 1 3 41
## 4 0% 1 4 29
## 5 0% 1 5 11
## 6 0% 2 1 65
summary(algae_density)
## TREAT PATCH QUAD ALGAE
## 0% :20 Min. : 1.00 Min. :1 Min. : 0.00
## 100%:20 1st Qu.: 4.75 1st Qu.:2 1st Qu.: 0.00
## 33% :20 Median : 8.50 Median :3 Median : 5.00
## 66% :20 Mean : 8.50 Mean :3 Mean :20.26
## 3rd Qu.:12.25 3rd Qu.:4 3rd Qu.:41.00
## Max. :16.00 Max. :5 Max. :83.00
Patch and Quad need to be converted to factors!
algae_density$PATCH <- factor(algae_density$PATCH)
algae_density$QUAD <- factor(algae_density$QUAD)
summary(algae_density)
## TREAT PATCH QUAD ALGAE
## 0% :20 1 : 5 1:16 Min. : 0.00
## 100%:20 2 : 5 2:16 1st Qu.: 0.00
## 33% :20 3 : 5 3:16 Median : 5.00
## 66% :20 4 : 5 4:16 Mean :20.26
## 5 : 5 5:16 3rd Qu.:41.00
## 6 : 5 Max. :83.00
## (Other):50
Much better.
Let’s make some visuals.
library(ggplot2)
ggplot(data = algae_density) +
geom_boxplot(aes(x = TREAT, y = ALGAE)) +
theme_bw()
We should also consider this from the perspective to collapsing the quadrats into their mean values.
library(dplyr)
algae_density_agg <-
algae_density %>%
group_by(TREAT, PATCH) %>%
summarize(ALGAE_QUAD_MEAN = mean(ALGAE))
ggplot(data = algae_density_agg) +
geom_boxplot(aes(x = TREAT, y = ALGAE_QUAD_MEAN)) +
theme_bw()
Construct an ANOVA model and look at the results.
algae_density_aov <- aov(data = algae_density, ALGAE ~ TREAT + Error(PATCH))
summary(algae_density_aov)
##
## Error: PATCH
## Df Sum Sq Mean Sq F value Pr(>F)
## TREAT 3 14429 4810 2.717 0.0913 .
## Residuals 12 21242 1770
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 64 19110 298.6
What if we want to look at two, or more, factors where one is not nested in the other?
See visuals as per Logan, Fig 12.1 and 12.2.
For formula, see Logan p. 314 and for calculation of F-ratios see Table 12.1.
Effects of density and season on the number of eggs produced by slipper limpets.
limpets1 <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter12/Data/quinn1.csv")
head(limpets1)
## DENSITY SEASON EGGS
## 1 6 spring 1.167
## 2 6 spring 0.500
## 3 6 spring 1.667
## 4 6 summer 4.000
## 5 6 summer 3.830
## 6 6 summer 3.830
summary(limpets1)
## DENSITY SEASON EGGS
## Min. : 6 spring:9 Min. :0.5000
## 1st Qu.: 6 summer:9 1st Qu.:0.8748
## Median :12 Median :1.6485
## Mean :14 Mean :1.9484
## 3rd Qu.:24 3rd Qu.:2.7075
## Max. :24 Max. :4.0000
limpets1$DENSITY <- factor(limpets1$DENSITY)
summary(limpets1)
## DENSITY SEASON EGGS
## 6 :6 spring:9 Min. :0.5000
## 12:6 summer:9 1st Qu.:0.8748
## 24:6 Median :1.6485
## Mean :1.9484
## 3rd Qu.:2.7075
## Max. :4.0000
ggplot(data = limpets1) +
geom_boxplot(aes(x = DENSITY, y = EGGS)) +
theme_bw()
ggplot(data = limpets1) +
geom_boxplot(aes(x = SEASON, y = EGGS)) +
theme_bw()
ggplot(data = limpets1) +
geom_boxplot(aes(x = DENSITY, y = EGGS, fill = SEASON)) +
theme_bw()
And let’s run the model.
limpets1_lm <- lm(data = limpets1, formula = EGGS ~ SEASON * DENSITY)
summary(limpets1_lm)
##
## Call:
## lm(formula = EGGS ~ SEASON * DENSITY, data = limpets1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.61133 -0.16167 -0.04217 0.09892 0.55567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1113333 0.2183934 5.089 0.000267 ***
## SEASONsummer 2.7753333 0.3088549 8.986 1.12e-06 ***
## DENSITY12 -0.0003333 0.3088549 -0.001 0.999157
## DENSITY24 -0.4166667 0.3088549 -1.349 0.202221
## SEASONsummer:DENSITY12 -0.9996667 0.4367868 -2.289 0.041029 *
## SEASONsummer:DENSITY24 -1.4700000 0.4367868 -3.365 0.005617 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3783 on 12 degrees of freedom
## Multiple R-squared: 0.9301, Adjusted R-squared: 0.9009
## F-statistic: 31.93 on 5 and 12 DF, p-value: 1.561e-06
anova(limpets1_lm)
## Analysis of Variance Table
##
## Response: EGGS
## Df Sum Sq Mean Sq F value Pr(>F)
## SEASON 1 17.1483 17.1483 119.845 1.336e-07 ***
## DENSITY 2 4.0019 2.0010 13.984 0.0007325 ***
## SEASON:DENSITY 2 1.6907 0.8454 5.908 0.0163632 *
## Residuals 12 1.7170 0.1431
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
limpets <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter12/Data/quinn.csv")
head(limpets)
## DENSITY SEASON EGGS
## 1 8 spring 2.875
## 2 8 spring 2.625
## 3 8 spring 1.750
## 4 8 summer 2.125
## 5 8 summer 1.500
## 6 8 summer 1.875
summary(limpets)
## DENSITY SEASON EGGS
## Min. : 8.00 spring:12 Min. :0.3560
## 1st Qu.:13.25 summer:12 1st Qu.:0.9165
## Median :22.50 Median :1.4330
## Mean :24.50 Mean :1.4717
## 3rd Qu.:33.75 3rd Qu.:1.9227
## Max. :45.00 Max. :2.8750
limpets$DENSITY <- factor(limpets$DENSITY)
Let’s get some visuals.
ggplot(data = limpets) +
geom_boxplot(aes(x = DENSITY, y = EGGS)) +
theme_bw()
ggplot(data = limpets) +
geom_boxplot(aes(x = SEASON, y = EGGS)) +
theme_bw()
ggplot(data = limpets) +
geom_boxplot(aes(x = DENSITY, y = EGGS, fill = SEASON)) +
theme_bw()
And let’s run the model.
limpets_lm <- lm(data = limpets, formula = EGGS ~ SEASON * DENSITY)
summary(limpets_lm)
##
## Call:
## lm(formula = EGGS ~ SEASON * DENSITY, data = limpets)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6667 -0.2612 -0.0610 0.2292 0.6647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.41667 0.24642 9.807 3.6e-08 ***
## SEASONsummer -0.58333 0.34849 -1.674 0.11358
## DENSITY15 -0.23933 0.34849 -0.687 0.50206
## DENSITY30 -0.85133 0.34849 -2.443 0.02655 *
## DENSITY45 -1.21700 0.34849 -3.492 0.00301 **
## SEASONsummer:DENSITY15 -0.41633 0.49284 -0.845 0.41069
## SEASONsummer:DENSITY30 -0.17067 0.49284 -0.346 0.73363
## SEASONsummer:DENSITY45 -0.02367 0.49284 -0.048 0.96229
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4268 on 16 degrees of freedom
## Multiple R-squared: 0.749, Adjusted R-squared: 0.6392
## F-statistic: 6.822 on 7 and 16 DF, p-value: 0.000745
anova(limpets_lm)
## Analysis of Variance Table
##
## Response: EGGS
## Df Sum Sq Mean Sq F value Pr(>F)
## SEASON 1 3.2502 3.2502 17.8419 0.0006453 ***
## DENSITY 3 5.2841 1.7614 9.6691 0.0007041 ***
## SEASON:DENSITY 3 0.1647 0.0549 0.3014 0.8239545
## Residuals 16 2.9146 0.1822
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
See Logan Fig 12.3, p. 325. The variance in a model is “explained” sequentially.
Provided there’s time, we’ll play with the models above.