Recall the linear regression model we’ve been working with quite a bit this semester:
\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \]
where
\[ \epsilon_i \sim N(0, \sigma^2) \]
Two of the major assumptions in this model are that:
We refer to the class of models that rely on the above structure as General Linear Models. And as we’ve learned throug this semester, they can be be very good when dealing with continuous response variables that are normally distributed. But what do we do when this is not the case?
Generalized Linear Models are extensions of the models we’ve been using that free us from the constraints of these two assumptions.
The basic structure of a Generalized Linear Model (GLM) is:
\[ g(Y_i) = \beta_0 + \beta_1 X_i + \epsilon_i \]
where
\[ \epsilon_i \sim \text{variance function} \]
In the above, we refer to \(g(\bullet)\) as the link function. Quite litteraly, the link function “links” the linear model with the response (dependent) variable.
Here are some of the classic examples when we would use GLM instead of traditional regression:
Let’s start by considering the case of Bernoulli or Binomially ditributed respose variables. For example:
In all of these cases, we can consider \(Y_i \sim \text{Binom}(n_i, p_i)\).
The link function for Binomial data is:
\[ g(Y_i) = log( \frac{p_i}{1-p_i} ) = log(\text{"odds"}) \]
This is also known as the logit.
And the variance function is:
\[ V(Y_i) \propto p_i(1-p_i) \]
Looking at our basic structure formula, and applying our link function, we can write our model as:
\[ log( \frac{p_i}{1-p_i} ) = \beta_0 + \beta_1 X_i + \epsilon_i \]
And if we were to re-arrange this so \(p_i\) was by itself, we would get:
\[ p_i = \frac{e^{ \beta_0 + \beta_1 X_i}}{ 1 + e^{ \beta_0 + \beta_1 X_i}} \]
How do we estimate \(\beta_0\), \(\beta_1\), and our variance component? Because our errors are not normally distributed, we can not use OLS, so instead we have to use Maximum Likelihood. We have to maximize the likelihood of the Binomial distribution:
\[ L(p_i|Y_i) = \prod_{i=1}^{N} \frac{n_i!}{Y_i!(n_i - Y_i)!} p_i^{Y_i} (1-p_i)^{(n_i-Y_i)} \]
where we substitute \(p_i\) with the previous equation for \(p_i\) in terms of our \(\beta\)’s.
If we fit a logistic model without an intercept, the best fit line must predict p = 0.5 when x = 0. So the intercept in this case shifts the probability for x = 0 up or down relative to 0.5.
The “slope” \(\hat{\beta_1}\) can be interpreted as, \(e^{\hat{\beta_1}}\) is the change in the log odds for a 1 unit change in X.
From Logan p. 498
As part of an investigation into the factors controlling island spider populations, Polis et al. (1998) recorded the physical and biotic characteristics of the islands in the Gulf of California. Quinn and Keough (2002) subsequently modelled the presence/absence (PA) of a potential spider predator (Uta lizards) against the perimeter to area ratio (RATIO) of the islands to illustrate logistic regression (from Box 13.1 of Quinn and Keough (2002)).
Get the data
spiders <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter17/Data/polis.csv")
spiders
## ISLAND RATIO PA
## 1 Bota 15.41 1
## 2 Cabeza 5.63 1
## 3 Cerraja 25.92 1
## 4 Coronadito 15.17 0
## 5 Flecha 13.04 1
## 6 Gemelose 18.85 0
## 7 Gemelosw 30.95 0
## 8 Jorabado 22.87 0
## 9 Mitlan 12.01 0
## 10 Pata 11.60 1
## 11 Piojo 6.09 1
## 12 Smith 2.28 1
## 13 Ventana 4.05 1
## 14 Bahiaan 59.94 0
## 15 Bahiaas 63.16 0
## 16 Blanca 22.76 0
## 17 Pescador 23.54 0
## 18 Angeldlg 0.21 1
## 19 Mejia 2.55 1
Let’s plot these data
library(ggplot2)
ggplot(data = spiders, aes(x = RATIO, y = PA)) +
geom_point() +
theme_bw()
Let’s fit a general linear model, and see what happens.
spiders.lm <- lm(data = spiders, formula = PA ~ RATIO)
summary(spiders.lm)
##
## Call:
## lm(formula = PA ~ RATIO, data = spiders)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6493 -0.4447 0.1778 0.2641 0.6049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.868813 0.140880 6.167 1.03e-05 ***
## RATIO -0.018278 0.005565 -3.284 0.00438 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4129 on 17 degrees of freedom
## Multiple R-squared: 0.3882, Adjusted R-squared: 0.3522
## F-statistic: 10.79 on 1 and 17 DF, p-value: 0.004376
plot(spiders.lm)
ggplot(data = spiders, aes(x = RATIO, y = PA)) +
geom_point() +
theme_bw() +
stat_smooth(method = "lm")
Let’s try a generalized linear model (GLM)
glm
family
- look at link
options?glm
spiders.glm <- glm(data = spiders, formula = PA ~ RATIO, family = binomial)
plot(spiders.glm)
We should expect some of the trends in the Residcuals vs Predicted values that we see.
Let’s look at the summary.
summary(spiders.glm)
##
## Call:
## glm(formula = PA ~ RATIO, family = binomial, data = spiders)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6067 -0.6382 0.2368 0.4332 2.0986
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.6061 1.6953 2.127 0.0334 *
## RATIO -0.2196 0.1005 -2.184 0.0289 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26.287 on 18 degrees of freedom
## Residual deviance: 14.221 on 17 degrees of freedom
## AIC: 18.221
##
## Number of Fisher Scoring iterations: 6
Let’s plot this model.
ggplot(data = spiders, aes(x = RATIO, y = PA)) +
geom_point() +
theme_bw() +
stat_smooth(method = "glm", method.args = list(family = "binomial"))
We can test whether the GLM is significantly better than the intercept only model by using the anova
function.
anova(spiders.glm, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: PA
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 18 26.287
## RATIO 1 12.066 17 14.221 0.0005134 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that for GLMs, summary
does not return an R2 value. We can still calculate a sense for the R2 by looking at the deviance explained by our model, versus the null.
1 - (spiders.glm$deviance/spiders.glm$null.deviance)
## [1] 0.4590197
Some times we have a mix of continuous and categorical predictor variables, with a continuous response variable.
See Figure 15.1 in Logan p. 449
From Logan p. 457
To investigate the impacts of sexual activity on male fruitfly longevity, Partridge and Farquhar (1981), measured the longevity of male fruitflies with access to either one virgin female (potential mate), eight virgin females, one pregnant female (not a potential mate), eight pregnant females or no females. The available male fruitflies varied in size and since size is known to impact longevity, the researchers randomly allocated each individual fruitfly to one of the five treatments and also measured thorax length as a covariate (from Box 12.1 of Quinn and Keough (2002)).
Get the data
fruitfly <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter15/Data/partridge.csv")
head(fruitfly)
## TREATMENT THORAX LONGEV
## 1 Preg8 0.64 35
## 2 Preg8 0.68 37
## 3 Preg8 0.68 49
## 4 Preg8 0.72 46
## 5 Preg8 0.72 63
## 6 Preg8 0.76 39
Plot these data
By group membership
ggplot(data = fruitfly, aes(x = TREATMENT, y = LONGEV)) +
geom_boxplot() +
theme_bw()
By size
ggplot(data = fruitfly, aes(x = THORAX, y = LONGEV)) +
geom_point() +
theme_bw()
By both
ggplot(data = fruitfly, aes(x = THORAX, y = LONGEV, colour = TREATMENT)) +
geom_point() +
theme_bw()
With linear regressions
ggplot(data = fruitfly, aes(x = THORAX, y = LONGEV)) +
geom_point(aes(colour = TREATMENT)) +
stat_smooth(aes(colour = TREATMENT), method = "lm") +
stat_smooth(method = "lm", size = 2, colour = "black") +
theme_bw()
Let’s build an ANCOVA model
fruitfly_ancova <- lm(data = fruitfly, LONGEV ~ TREATMENT * THORAX)
Looking at the ANOVA table for this model, the fact that TREATMENT:THORAX interaction is not significant indicates that the slopes are not significantly different from each other.
anova(fruitfly_ancova)
## Analysis of Variance Table
##
## Response: LONGEV
## Df Sum Sq Mean Sq F value Pr(>F)
## TREATMENT 4 11939.3 2984.8 26.1983 1.896e-15 ***
## THORAX 1 13168.9 13168.9 115.5855 < 2.2e-16 ***
## TREATMENT:THORAX 4 42.5 10.6 0.0933 0.9844
## Residuals 115 13102.1 113.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Look at the model diagnostics
plot(fruitfly_ancova)
Note that there is a wedge-shaped pattern to the residuals, suggestive of non-homogeniety of variance. We can use a log10 transform to deal with this problem.
fruitfly_ancova_2 <- lm(data = fruitfly, log10(LONGEV) ~ TREATMENT + THORAX)
plot(fruitfly_ancova_2)
summary(fruitfly_ancova_2)
##
## Call:
## lm(formula = log10(LONGEV) ~ TREATMENT + THORAX, data = fruitfly)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.226736 -0.058445 -0.003469 0.059961 0.170389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.79095 0.08443 9.368 5.89e-16 ***
## TREATMENTPreg1 0.02260 0.02368 0.954 0.3419
## TREATMENTPreg8 0.03648 0.02385 1.530 0.1287
## TREATMENTVirg1 -0.05381 0.02366 -2.275 0.0247 *
## TREATMENTVirg8 -0.18165 0.02392 -7.592 7.79e-12 ***
## THORAX 1.19385 0.09900 12.060 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08364 on 119 degrees of freedom
## Multiple R-squared: 0.7055, Adjusted R-squared: 0.6932
## F-statistic: 57.02 on 5 and 119 DF, p-value: < 2.2e-16
Look at how the ANCOVA results differ from either ignoring the continuous or categorical predictor.
\[ \chi^2 = \sum \frac{(o-e)^2}{e} \]
where \(o =\) observed and \(e =\) expected frequencies.
To examine if some set of data is distributed in a specific way, or has a frequency based on a known distribution, we can use the Kolmogorov-Smirnov Test. You can also use the KS Test to see if two sets of data are distributed in the same way.
Examine if two or more categorical variables are associated with each other. This is essentially and extension of the \(\chi^2\) analysis. The Null hypothesis is that the variables are not associated.
This is an alternative to the Goodness of fit tests and contingency tables approach.
From Logan p. 472
The G-test is based on a log likelihood-ratio test. A log likelihood ratio is a ratio of maximum likelihoods of the alternative and null hypotheses. More simply, a log likelihood ratio test essentially examines how likely (the probability) the alternative hypothesis (representing an effect) is compared to how likely the null hypothesis (no effect) is given the collected data.
The \(G^2\) statistic is
\[ G^2 = 2 \sum_i o_i log(\frac{o_i}{e_i}) \]
where \(o =\) observed and \(e =\) expected frequencies.
From Logan pp. 478 - 480
In order to investigate the mortality of coolibah (Eucalyptus coolibah) trees across riparian dunes, Roberts (1993) counted the number of quadrats in which dead trees were present and the number in which they were absent in three positions (top, middle and bottom) along transects from the lakeshore up to the top of dunes. In this case, the classification of quadrats according to the presence/absence of dead coolibah trees will be interpreted as a response variable and the position along transect as a predictor variable (see Box 14.3 of Quinn and Keough (2002)).
Get the data
trees <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter16/Data/roberts.csv")
head(trees)
## QUADRAT POSITION DEAD
## 1 1 Bottom With
## 2 2 Bottom With
## 3 3 Bottom Without
## 4 4 Bottom Without
## 5 5 Top Without
## 6 6 Bottom With
Make a contingency table
trees_xtab <- table(trees$POSITION, trees$DEAD)
Run a \(\chi^2\) test
(trees_chisq <- chisq.test(trees_xtab, corr = F))
## Warning in chisq.test(trees_xtab, corr = F): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: trees_xtab
## X-squared = 13.661, df = 2, p-value = 0.00108
trees_chisq
##
## Pearson's Chi-squared test
##
## data: trees_xtab
## X-squared = 13.661, df = 2, p-value = 0.00108
We can also run a G-test using likelihood.test
in the package Deducer
(but note that this package as a lot of dependencies for this package, as it is part of the JGR project).
library(Deducer)
likelihood.test(trees_xtab, conservative = TRUE)
From the results of both of these tests, we can reject the null hypothesis of no association between transect position and number of dead trees observed.