Section Goals


Introduction to Linear Models

Statistical Models

From Logan, p. 151:

A statistical model is an expression that attempts to explain patterns in the observed values of a response variable by relating the response variable to a set of predictor variables and parameters.

A generic model looks like:

response variable = model + error

Simple linear model

\[ y = bx + a \]

where \(b\) is the slope and \(a\) is the y-intercept.

To make this a statistical model we could use the following equation:

\[ y_i = \beta_0 + \beta_1 * x_i + \epsilon_i \]

Considering all data points at the same time, we can write this as:

\[ Y \sim \beta_0 + \beta_1 * X + \epsilon \]

where:

\[ \epsilon \sim N(0,\sigma^2) \]

Co-variation

Let’s take a step back and ignore the potential cause-effect relationships implied by the above linear model (i.e., the model that suggests a change in \(x\) results in a change in \(y\)). Perhaps changes in \(x\) and \(y\) are only correlated with each other, or said another way, they co-vary with each other.

We can measure the strength of this relationship as co-variation and/or correlation.

Co-variance: is known as the sum of cross-products.

\[ \frac{\sum{(x_i-\bar{x})(y_i-\bar{y})}}{n-1} \]

Example with iris data

Calculate co-variation between sepal length and sepal width, using the iris data set.

data(iris)

library(ggplot2)
ggplot(data = iris) +
  geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
  theme_bw()

covar <- sum((iris$Sepal.Length - mean(iris$Sepal.Length)) * (iris$Sepal.Width - mean(iris$Sepal.Width)))/ (nrow(iris) - 1)

## Confirm
var(x = iris$Sepal.Length, y = iris$Sepal.Width)
## [1] -0.042434

Correlation

We can standardize co-variation to take on a value between -1 and 1, yielding a value called the Pearson’s correlation coefficient or the product-moment coefficient.

\[ \rho_{xy} = \frac{\sum{(x_i-\bar{x})(y_i-\bar{y})}}{\sqrt{\sum{(x_i-\bar{x})^2 \sum{(y_i-\bar{y})^2}}}} \]

If we use \(\hat{\bar{x}}\) and \(\hat{\bar{y}}\), that is the sample means for x and y, then \(\rho\) is represented by \(r\), and called the sample correlation coefficient statistic. \(r\) has an underlying distribution, just like other sample statistics (e.g., \(t\) or \(F\)), which is essentially a \(t\) distribution with a standard error of:

\[ s_r = \sqrt{\frac{(1-r^2)}{(n-2)}} \]

Then the \(t\) value is \(t = \frac{r}{s_r}\).

Example of correlation with iris data

Calculate the correlation between sepal length and sepal width, and calculate whether \(r\) is significantly different from 0 or not.

cor_iris <- covar / (sd(iris$Sepal.Length)*sd(iris$Sepal.Width))

se_cor_iris <- sqrt((1 - cor_iris^2)/(nrow(iris) - 2))

t_cor_iris <- cor_iris/se_cor_iris

## Calculate the probability of getting our t value, or one more extreme
pt(q = t_cor_iris, df = (nrow(iris)-2))*2
## [1] 0.1518983
## Use `cor` to find correlation value
with(data = iris, cor(x = Sepal.Length, y = Sepal.Width, method = "pearson"))
## [1] -0.1175698
## Test the correlation
with(data = iris, cor.test(x = Sepal.Length, y = Sepal.Width, method = "pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  Sepal.Length and Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.27269325  0.04351158
## sample estimates:
##        cor 
## -0.1175698

Robust correlation

For Pearson’s correlation coefficient to be appropriate, the data should have:

  1. A linear relationship
  2. A bivariate normal distribution
  • Spearman’s Rank Correlation - calculates correlation on ranks of observed values
  • Kendall’s Rank Correlation

Linear regression

Now, let’s assume that any change in y is due to a change in x.

Example of linear regression

Effects of starvation and humidity on water loss in the confused flour beetle. Here, looking at the relationship between humidity and weight loss

flr_beetle <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter8/Data/nelson.csv")
flr_beetle
##   HUMIDITY WEIGHTLOSS
## 1      0.0       8.98
## 2     12.0       8.14
## 3     29.5       6.67
## 4     43.0       6.08
## 5     53.0       5.90
## 6     62.5       5.83
## 7     75.5       4.68
## 8     85.0       4.20
## 9     93.0       3.72

Plot these data

library(ggplot2)
ggplot() +
  geom_point(data = flr_beetle, aes( x = HUMIDITY, y = WEIGHTLOSS)) +
  theme_bw()

Run a linear regression

flr_beetle_lm <- lm(data = flr_beetle, WEIGHTLOSS ~ HUMIDITY)
summary(flr_beetle_lm)
## 
## Call:
## lm(formula = WEIGHTLOSS ~ HUMIDITY, data = flr_beetle)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46397 -0.03437  0.01675  0.07464  0.45236 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.704027   0.191565   45.44 6.54e-10 ***
## HUMIDITY    -0.053222   0.003256  -16.35 7.82e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2967 on 7 degrees of freedom
## Multiple R-squared:  0.9745, Adjusted R-squared:  0.9708 
## F-statistic: 267.2 on 1 and 7 DF,  p-value: 7.816e-07

Plot these data, with lm fit

ggplot(data = flr_beetle, aes( x = HUMIDITY, y = WEIGHTLOSS)) +
  geom_point() +
  stat_smooth(method = "lm") +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

Linear regression assumptions

The big three:

  1. Normality: The \(y_i\) AND error values (\(\epsilon_i\)) are normally distributed. If normality is violated, p-values and confidence intervals may be inaccurate and unreliable.
  2. Homogeneity of variance: The \(y_i\) AND error values (\(\epsilon_i\)) have similar levels of variance across all \(x_i\) values.
  3. Independence: The \(y_i\) AND error values are independent of each other.

Other assumptions:

  • Linearity: The relationship between \(x_i\) and \(y_i\) is linear (only important when using simple linear regression).
  • Fixed \(x\) values: The \(x_i\) values are measured without error. In practice this means the error in the \(x_i\) values is much smaller than that in the \(y_i\) values.

Line fitting and model evalutation

You can access a video going through this section here - https://youtu.be/2tI5YFDajp8

Finding the best-fit line

We have jumped into linear regression by going through the example above relating flour beetle weight loss to humidity, but we have not gone into detail about how “the best fit line” is actually calculated.

We are not going to go into the mathematical details of line fitting here (you can get more details looking at Box 5.5 on pp. 83-84 in Quinn and Keough 2002), but we will go through an intuitive explanation.

Recall, that our goal is to fit a line through the cloud of data points such that the distance between the points and the line is minimized. These distances are what we call residuals (or errors), and are symbolized with \(\epsilon\) or \(e\).

Looking at the plot made with the code below, the vertical black lines represent the residuals.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
flr_beetle$y_hat <- predict(flr_beetle_lm, newdata = flr_beetle)
flr_beetle$y_min <- apply(select(flr_beetle, WEIGHTLOSS, y_hat), MARGIN = 1, FUN = min)
flr_beetle$y_max <- apply(select(flr_beetle, WEIGHTLOSS, y_hat), MARGIN = 1, FUN = max)

ggplot(data = flr_beetle, aes( x = HUMIDITY, y = WEIGHTLOSS)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE) +
  geom_linerange(aes(ymin = y_min, ymax = y_max )) +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

Again, our goal is to fit a line that minimizes these residuals. In a mathematical sense, what we want to do is find estimates of \(\beta_0\) and \(\beta_1\), i.e., our slope and intercept values, to use in the equation of this line, \(\hat{y} = b_0 + b_1 x\), such that we minimize the value \(\sum(y_i - \hat{y})^2\).

This last equation, \(\sum(y_i - \hat{y})^2\), goes by several names, including the sum of squared residuals, sum of squared errors, and sum of squared deviations. For short, I’ll often use \(SS_{res}\).

It is possible to minimize \(SS_{res}\) by taking a guess at \(b_0\) and \(b_1\), calculating \(SS_{res}\), taking another guess at the parameters, and accepting them if \(SS_{res}\) is smaller. Keep doing this until to cannot minimize \(SS_{res}\) any further. But this is a pretty inefficient approach. Instead, using the probability distribution for the line and a few principles of calculus, we can find the values of \(b_0\) and \(b_1\) that minimize \(SS_{res}\).

Model evalutation

In order to determine if your linear regression model is meaningful (or significant) you must compare the variance explained by your model versus the variance NOT explained by your model (conveniently referred to as the variance unexplained).

Your residuals represent the variation that is not explained!

Logan - Figure 8.3
Logan - Figure 8.3

Note: there is a typo in this figure in panel (c). Instead of “Explained variability”, the arrow tag should be “Unexplained variability”.

In this figure, we see in panel (d) that we compare the variation explained to the variation unexplained, and use an F-ratio test to see if the amount of variation ascribed to each are different. If the explained variation is relatively large compared to the unexplained variation, then this ratio will be large. Larger \(F\) ratios are more extreme, when considering an F distribution. Thus, an extreme \(F\) ratio would imply that the two variances are not equal, and specifically, that the amount of variance explained is significantly greater than that unexplained.

Standard error of regression coefficients, the regression line, and \(\hat{y}_i\) predictions

Regression coefficients are statistics and thus we can determine the standard error of these statistics. From there, we can use these values and the t-distribution to determine confidence intervals. (Note that we can also see if these coefficients are significantly different from 0 using a Wald Test (similar to the \(t\) test).)

The confidence interval for \(b_1\) is:

\[ b_1 \pm t_{0.05, n-2}s_{b_{1}} \]

\(\beta_1\) standard error

\[ s_{b_{1}} = \sqrt{ \frac{MS_{Residual}}{\sum^n_{i=1}(x_i-\bar{x})^2} } \]

\(\beta_0\) standard error

\[ s_{b_{0}} = \sqrt{ MS_{Residual} [\frac{1}{n} + \frac{\bar{x}^2}{\sum^n_{i=1}(x_i-\bar{x})^2}] } \]

Confidence bands for regression line

From Quinn and Keough, p. 87:

The 95% confidence band is a biconcave band that will contain the true population regression line 95% of the time.

\(\hat{y}_i\) standard error

\[ s_{\hat{y}} = \sqrt{ MS_{Residual} [1 + \frac{1}{n} + \frac{x_p - \bar{x}^2}{\sum^n_{i=1}(x_i-\bar{x})^2}] } \]

where \(x_p\) is the value from \(x\) we are “predicting” a \(y\) value for.

Linear regression diagnostics

If we plot the lm result, we will see a set of diagnostic plots:

plot(flr_beetle_lm)

More details on linear regression diagnostics can be found in this video - https://youtu.be/pOvZsR5JW2w.