Generalized Linear Models (GLMs)

Recall the linear regression model we’ve been working with quite a bit this semester:

\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \]

where

\[ \epsilon_i \sim N(0, \sigma^2) \]

Two of the major assumptions in this model are that:

  1. The relationship between the independent and dependent variables is linear.
  2. The errors, \(\epsilon_i\), are normally distributed.

We refer to the class of models that rely on the above structure as General Linear Models. And as we’ve learned throug this semester, they can be be very good when dealing with continuous response variables that are normally distributed. But what do we do when this is not the case?

Generalized Linear Models are extensions of the models we’ve been using that free us from the constraints of these two assumptions.

The basic structure of a Generalized Linear Model (GLM) is:

\[ g(Y_i) = \beta_0 + \beta_1 X_i + \epsilon_i \]

where

\[ \epsilon_i \sim \text{variance function} \]

In the above, we refer to \(g(\bullet)\) as the link function. Quite litteraly, the link function “links” the linear model with the response (dependent) variable.

Here are some of the classic examples when we would use GLM instead of traditional regression:

Logistic regression

Let’s start by considering the case of Bernoulli or Binomially ditributed respose variables. For example:

  • survival vs. mortality for individual organisms
  • allele of type I or type II for an individual
  • presence or absence of a species

In all of these cases, we can consider \(Y_i \sim \text{Binom}(n_i, p_i)\).

The link function for Binomial data is:

\[ g(Y_i) = log( \frac{p_i}{1-p_i} ) = log(\text{"odds"}) \]

This is also known as the logit.

And the variance function is:

\[ V(Y_i) \propto p_i(1-p_i) \]

Looking at our basic structure formula, and applying our link function, we can write our model as:

\[ log( \frac{p_i}{1-p_i} ) = \beta_0 + \beta_1 X_i + \epsilon_i \]

And if we were to re-arrange this so \(p_i\) was by itself, we would get:

\[ p_i = \frac{e^{ \beta_0 + \beta_1 X_i}}{ 1 + e^{ \beta_0 + \beta_1 X_i}} \]

How do we estimate \(\beta_0\), \(\beta_1\), and our variance component? Because our errors are not normally distributed, we can not use OLS, so instead we have to use Maximum Likelihood. We have to maximize the likelihood of the Binomial distribution:

\[ L(p_i|Y_i) = \prod_{i=1}^{N} \frac{n_i!}{Y_i!(n_i - Y_i)!} p_i^{Y_i} (1-p_i)^{(n_i-Y_i)} \]

where we substitute \(p_i\) with the previous equation for \(p_i\) in terms of our \(\beta\)’s.

Interpreting the coefficients in logistic regression

If we fit a logistic model without an intercept, the best fit line must predict p = 0.5 when x = 0. So the intercept in this case shifts the probability for x = 0 up or down relative to 0.5.

The “slope” \(\hat{\beta_1}\) can be interpreted as, \(e^{\hat{\beta_1}}\) is the change in the log odds for a 1 unit change in X.

Example of Logistic Regression

From Logan p. 498

As part of an investigation into the factors controlling island spider populations, Polis et al. (1998) recorded the physical and biotic characteristics of the islands in the Gulf of California. Quinn and Keough (2002) subsequently modelled the presence/absence (PA) of a potential spider predator (Uta lizards) against the perimeter to area ratio (RATIO) of the islands to illustrate logistic regression (from Box 13.1 of Quinn and Keough (2002)).

Get the data

spiders <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter17/Data/polis.csv")

spiders
##        ISLAND RATIO PA
## 1        Bota 15.41  1
## 2      Cabeza  5.63  1
## 3     Cerraja 25.92  1
## 4  Coronadito 15.17  0
## 5      Flecha 13.04  1
## 6    Gemelose 18.85  0
## 7    Gemelosw 30.95  0
## 8    Jorabado 22.87  0
## 9      Mitlan 12.01  0
## 10       Pata 11.60  1
## 11      Piojo  6.09  1
## 12      Smith  2.28  1
## 13    Ventana  4.05  1
## 14    Bahiaan 59.94  0
## 15    Bahiaas 63.16  0
## 16     Blanca 22.76  0
## 17   Pescador 23.54  0
## 18   Angeldlg  0.21  1
## 19      Mejia  2.55  1

Let’s plot these data

library(ggplot2)

ggplot(data = spiders, aes(x = RATIO, y = PA)) +
  geom_point() +
  theme_bw()

Let’s fit a general linear model, and see what happens.

spiders.lm <- lm(data = spiders, formula = PA ~ RATIO)

summary(spiders.lm)
## 
## Call:
## lm(formula = PA ~ RATIO, data = spiders)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6493 -0.4447  0.1778  0.2641  0.6049 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.868813   0.140880   6.167 1.03e-05 ***
## RATIO       -0.018278   0.005565  -3.284  0.00438 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4129 on 17 degrees of freedom
## Multiple R-squared:  0.3882, Adjusted R-squared:  0.3522 
## F-statistic: 10.79 on 1 and 17 DF,  p-value: 0.004376
plot(spiders.lm)

ggplot(data = spiders, aes(x = RATIO, y = PA)) +
  geom_point() +
  theme_bw() +
  stat_smooth(method = "lm")

Let’s try a generalized linear model (GLM)

  • Look at help for glm
  • Follow help for family - look at link options
?glm

spiders.glm <- glm(data = spiders, formula = PA ~ RATIO, family = binomial)

plot(spiders.glm)

We should expect some of the trends in the Residcuals vs Predicted values that we see.

Let’s look at the summary.

summary(spiders.glm)
## 
## Call:
## glm(formula = PA ~ RATIO, family = binomial, data = spiders)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6067  -0.6382   0.2368   0.4332   2.0986  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   3.6061     1.6953   2.127   0.0334 *
## RATIO        -0.2196     0.1005  -2.184   0.0289 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26.287  on 18  degrees of freedom
## Residual deviance: 14.221  on 17  degrees of freedom
## AIC: 18.221
## 
## Number of Fisher Scoring iterations: 6

Let’s plot this model.

ggplot(data = spiders, aes(x = RATIO, y = PA)) +
  geom_point() +
  theme_bw() +
  stat_smooth(method = "glm", method.args = list(family = "binomial"))

We can test whether the GLM is significantly better than the intercept only model by using the anova function.

anova(spiders.glm, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: PA
## 
## Terms added sequentially (first to last)
## 
## 
##       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                     18     26.287              
## RATIO  1   12.066        17     14.221 0.0005134 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that for GLMs, summary does not return an R2 value. We can still calculate a sense for the R2 by looking at the deviance explained by our model, versus the null.

1 - (spiders.glm$deviance/spiders.glm$null.deviance)
## [1] 0.4590197

Poisson regression

  • Good for modeling a response variable that is count data
  • Can also be used as an alternative to contingency tables to explore associations between categorical varialbes
  • Link function is \(log(\mu)\)

ANCOVA

Some times we have a mix of continuous and categorical predictor variables, with a continuous response variable.

See Figure 15.1 in Logan p. 449

Logan - Figure 15.1

Logan - Figure 15.1

Example

From Logan p. 457

To investigate the impacts of sexual activity on male fruitfly longevity, Partridge and Farquhar (1981), measured the longevity of male fruitflies with access to either one virgin female (potential mate), eight virgin females, one pregnant female (not a potential mate), eight pregnant females or no females. The available male fruitflies varied in size and since size is known to impact longevity, the researchers randomly allocated each individual fruitfly to one of the five treatments and also measured thorax length as a covariate (from Box 12.1 of Quinn and Keough (2002)).

Get the data

fruitfly <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter15/Data/partridge.csv")

head(fruitfly)
##   TREATMENT THORAX LONGEV
## 1     Preg8   0.64     35
## 2     Preg8   0.68     37
## 3     Preg8   0.68     49
## 4     Preg8   0.72     46
## 5     Preg8   0.72     63
## 6     Preg8   0.76     39

Plot these data

By group membership

ggplot(data = fruitfly, aes(x = TREATMENT, y = LONGEV)) +
  geom_boxplot() +
  theme_bw()

By size

ggplot(data = fruitfly, aes(x = THORAX, y = LONGEV)) +
  geom_point() +
  theme_bw()

By both

ggplot(data = fruitfly, aes(x = THORAX, y = LONGEV, colour = TREATMENT)) +
  geom_point() +
  theme_bw()

With linear regressions

ggplot(data = fruitfly, aes(x = THORAX, y = LONGEV)) +
  geom_point(aes(colour = TREATMENT)) +
  stat_smooth(aes(colour = TREATMENT), method = "lm") +
  stat_smooth(method = "lm", size = 2, colour = "black") +
  theme_bw()

Let’s build an ANCOVA model

fruitfly_ancova <- lm(data = fruitfly, LONGEV ~ TREATMENT * THORAX)

Looking at the ANOVA table for this model, the fact that TREATMENT:THORAX interaction is not significant indicates that the slopes are not significantly different from each other.

anova(fruitfly_ancova)
## Analysis of Variance Table
## 
## Response: LONGEV
##                   Df  Sum Sq Mean Sq  F value    Pr(>F)    
## TREATMENT          4 11939.3  2984.8  26.1983 1.896e-15 ***
## THORAX             1 13168.9 13168.9 115.5855 < 2.2e-16 ***
## TREATMENT:THORAX   4    42.5    10.6   0.0933    0.9844    
## Residuals        115 13102.1   113.9                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Look at the model diagnostics

plot(fruitfly_ancova)

Note that there is a wedge-shaped pattern to the residuals, suggestive of non-homogeniety of variance. We can use a log10 transform to deal with this problem.

fruitfly_ancova_2 <- lm(data = fruitfly, log10(LONGEV) ~ TREATMENT + THORAX)

plot(fruitfly_ancova_2)

summary(fruitfly_ancova_2)
## 
## Call:
## lm(formula = log10(LONGEV) ~ TREATMENT + THORAX, data = fruitfly)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.226736 -0.058445 -0.003469  0.059961  0.170389 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.79095    0.08443   9.368 5.89e-16 ***
## TREATMENTPreg1  0.02260    0.02368   0.954   0.3419    
## TREATMENTPreg8  0.03648    0.02385   1.530   0.1287    
## TREATMENTVirg1 -0.05381    0.02366  -2.275   0.0247 *  
## TREATMENTVirg8 -0.18165    0.02392  -7.592 7.79e-12 ***
## THORAX          1.19385    0.09900  12.060  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08364 on 119 degrees of freedom
## Multiple R-squared:  0.7055, Adjusted R-squared:  0.6932 
## F-statistic: 57.02 on 5 and 119 DF,  p-value: < 2.2e-16

Challenge

Look at how the ANCOVA results differ from either ignoring the continuous or categorical predictor.

Frequency Analysis

\[ \chi^2 = \sum \frac{(o-e)^2}{e} \]

where \(o =\) observed and \(e =\) expected frequencies.

To examine if some set of data is distributed in a specific way, or has a frequency based on a known distribution, we can use the Kolmogorov-Smirnov Test. You can also use the KS Test to see if two sets of data are distributed in the same way.

Examine if two or more categorical variables are associated with each other. This is essentially and extension of the \(\chi^2\) analysis. The Null hypothesis is that the variables are not associated.

This is an alternative to the Goodness of fit tests and contingency tables approach.

From Logan p. 472

The G-test is based on a log likelihood-ratio test. A log likelihood ratio is a ratio of maximum likelihoods of the alternative and null hypotheses. More simply, a log likelihood ratio test essentially examines how likely (the probability) the alternative hypothesis (representing an effect) is compared to how likely the null hypothesis (no effect) is given the collected data.

The \(G^2\) statistic is

\[ G^2 = 2 \sum_i o_i log(\frac{o_i}{e_i}) \]

where \(o =\) observed and \(e =\) expected frequencies.

Example - Two-way contingency table

From Logan pp. 478 - 480

In order to investigate the mortality of coolibah (Eucalyptus coolibah) trees across riparian dunes, Roberts (1993) counted the number of quadrats in which dead trees were present and the number in which they were absent in three positions (top, middle and bottom) along transects from the lakeshore up to the top of dunes. In this case, the classification of quadrats according to the presence/absence of dead coolibah trees will be interpreted as a response variable and the position along transect as a predictor variable (see Box 14.3 of Quinn and Keough (2002)).

Get the data

trees <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter16/Data/roberts.csv")

head(trees)
##   QUADRAT POSITION    DEAD
## 1       1   Bottom    With
## 2       2   Bottom    With
## 3       3   Bottom Without
## 4       4   Bottom Without
## 5       5      Top Without
## 6       6   Bottom    With

Make a contingency table

trees_xtab <- table(trees$POSITION, trees$DEAD)

Run a \(\chi^2\) test

(trees_chisq <- chisq.test(trees_xtab, corr = F))
## Warning in chisq.test(trees_xtab, corr = F): Chi-squared approximation may
## be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  trees_xtab
## X-squared = 13.661, df = 2, p-value = 0.00108
trees_chisq
## 
##  Pearson's Chi-squared test
## 
## data:  trees_xtab
## X-squared = 13.661, df = 2, p-value = 0.00108

We can also run a G-test using likelihood.test in the package Deducer (but note that this package as a lot of dependencies for this package, as it is part of the JGR project).

library(Deducer)

likelihood.test(trees_xtab, conservative = TRUE)

From the results of both of these tests, we can reject the null hypothesis of no association between transect position and number of dead trees observed.