From Logan, p. 151:
A statistical model is an expression that attempts to explain patterns in the observed values of a response variable by relating the response variable to a set of predictor variables and parameters.
A generic model looks like:
response variable = model + error
\[ y = bx + a \]
where \(b\) is the slope and \(a\) is the y-intercept.
To make this a statistical model we could use the following equation:
\[ y_i = \beta_0 + \beta_1 * x_i + \epsilon_i \]
Considering all data points at the same time, we can write this as:
\[ Y \sim \beta_0 + \beta_1 * X + \epsilon \]
where:
\[ \epsilon \sim N(0,\sigma^2) \]
Let’s take a step back and ignore the potential cause-effect relationships implied by the above linear model (i.e., the model that suggests a change in \(x\) results in a change in \(y\)). Perhaps changes in \(x\) and \(y\) are only correlated with each other, or said another way, they co-vary with each other.
We can measure the strength of this relationship as co-variation and/or correlation.
Co-variance: is known as the sum of cross-products.
\[ \frac{\sum{(x_i-\bar{x})(y_i-\bar{y})}}{n-1} \]
iris
dataCalculate co-variation between sepal length and sepal width, using
the iris
data set.
data(iris)
library(ggplot2)
ggplot(data = iris) +
geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
theme_bw()
covar <- sum((iris$Sepal.Length - mean(iris$Sepal.Length)) * (iris$Sepal.Width - mean(iris$Sepal.Width)))/ (nrow(iris) - 1)
## Confirm
var(x = iris$Sepal.Length, y = iris$Sepal.Width)
## [1] -0.042434
We can standardize co-variation to take on a value between -1 and 1, yielding a value called the Pearson’s correlation coefficient or the product-moment coefficient.
\[ \rho_{xy} = \frac{\sum{(x_i-\bar{x})(y_i-\bar{y})}}{\sqrt{\sum{(x_i-\bar{x})^2 \sum{(y_i-\bar{y})^2}}}} \]
If we use \(\hat{\bar{x}}\) and \(\hat{\bar{y}}\), that is the sample means for x and y, then \(\rho\) is represented by \(r\), and called the sample correlation coefficient statistic. \(r\) has an underlying distribution, just like other sample statistics (e.g., \(t\) or \(F\)), which is essentially a \(t\) distribution with a standard error of:
\[ s_r = \sqrt{\frac{(1-r^2)}{(n-2)}} \]
Then the \(t\) value is \(t = \frac{r}{s_r}\).
iris
dataCalculate the correlation between sepal length and sepal width, and calculate whether \(r\) is significantly different from 0 or not.
cor_iris <- covar / (sd(iris$Sepal.Length)*sd(iris$Sepal.Width))
se_cor_iris <- sqrt((1 - cor_iris^2)/(nrow(iris) - 2))
t_cor_iris <- cor_iris/se_cor_iris
## Calculate the probability of getting our t value, or one more extreme
pt(q = t_cor_iris, df = (nrow(iris)-2))*2
## [1] 0.1518983
## Use `cor` to find correlation value
with(data = iris, cor(x = Sepal.Length, y = Sepal.Width, method = "pearson"))
## [1] -0.1175698
## Test the correlation
with(data = iris, cor.test(x = Sepal.Length, y = Sepal.Width, method = "pearson"))
##
## Pearson's product-moment correlation
##
## data: Sepal.Length and Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.27269325 0.04351158
## sample estimates:
## cor
## -0.1175698
For Pearson’s correlation coefficient to be appropriate, the data should have:
Now, let’s assume that any change in y is due to a change in x.
Effects of starvation and humidity on water loss in the confused flour beetle. Here, looking at the relationship between humidity and weight loss
flr_beetle <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter8/Data/nelson.csv")
flr_beetle
## HUMIDITY WEIGHTLOSS
## 1 0.0 8.98
## 2 12.0 8.14
## 3 29.5 6.67
## 4 43.0 6.08
## 5 53.0 5.90
## 6 62.5 5.83
## 7 75.5 4.68
## 8 85.0 4.20
## 9 93.0 3.72
Plot these data
library(ggplot2)
ggplot() +
geom_point(data = flr_beetle, aes( x = HUMIDITY, y = WEIGHTLOSS)) +
theme_bw()
Run a linear regression
flr_beetle_lm <- lm(data = flr_beetle, WEIGHTLOSS ~ HUMIDITY)
summary(flr_beetle_lm)
##
## Call:
## lm(formula = WEIGHTLOSS ~ HUMIDITY, data = flr_beetle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46397 -0.03437 0.01675 0.07464 0.45236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.704027 0.191565 45.44 6.54e-10 ***
## HUMIDITY -0.053222 0.003256 -16.35 7.82e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2967 on 7 degrees of freedom
## Multiple R-squared: 0.9745, Adjusted R-squared: 0.9708
## F-statistic: 267.2 on 1 and 7 DF, p-value: 7.816e-07
Plot these data, with lm
fit
ggplot(data = flr_beetle, aes( x = HUMIDITY, y = WEIGHTLOSS)) +
geom_point() +
stat_smooth(method = "lm") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
You can access a video going through this section here - https://youtu.be/2tI5YFDajp8
We have jumped into linear regression by going through the example above relating flour beetle weight loss to humidity, but we have not gone into detail about how “the best fit line” is actually calculated.
We are not going to go into the mathematical details of line fitting here (you can get more details looking at Box 5.5 on pp. 83-84 in Quinn and Keough 2002), but we will go through an intuitive explanation.
Recall, that our goal is to fit a line through the cloud of data points such that the distance between the points and the line is minimized. These distances are what we call residuals (or errors), and are symbolized with \(\epsilon\) or \(e\).
Looking at the plot made with the code below, the vertical black lines represent the residuals.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
flr_beetle$y_hat <- predict(flr_beetle_lm, newdata = flr_beetle)
flr_beetle$y_min <- apply(select(flr_beetle, WEIGHTLOSS, y_hat), MARGIN = 1, FUN = min)
flr_beetle$y_max <- apply(select(flr_beetle, WEIGHTLOSS, y_hat), MARGIN = 1, FUN = max)
ggplot(data = flr_beetle, aes( x = HUMIDITY, y = WEIGHTLOSS)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE) +
geom_linerange(aes(ymin = y_min, ymax = y_max )) +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
Again, our goal is to fit a line that minimizes these residuals. In a mathematical sense, what we want to do is find estimates of \(\beta_0\) and \(\beta_1\), i.e., our slope and intercept values, to use in the equation of this line, \(\hat{y} = b_0 + b_1 x\), such that we minimize the value \(\sum(y_i - \hat{y})^2\).
This last equation, \(\sum(y_i - \hat{y})^2\), goes by several names, including the sum of squared residuals, sum of squared errors, and sum of squared deviations. For short, I’ll often use \(SS_{res}\).
It is possible to minimize \(SS_{res}\) by taking a guess at \(b_0\) and \(b_1\), calculating \(SS_{res}\), taking another guess at the parameters, and accepting them if \(SS_{res}\) is smaller. Keep doing this until to cannot minimize \(SS_{res}\) any further. But this is a pretty inefficient approach. Instead, using the probability distribution for the line and a few principles of calculus, we can find the values of \(b_0\) and \(b_1\) that minimize \(SS_{res}\).
In order to determine if your linear regression model is meaningful (or significant) you must compare the variance explained by your model versus the variance NOT explained by your model (conveniently referred to as the variance unexplained).
Your residuals represent the variation that is not explained!
Note: there is a typo in this figure in panel (c). Instead of “Explained variability”, the arrow tag should be “Unexplained variability”.
In this figure, we see in panel (d) that we compare the variation explained to the variation unexplained, and use an F-ratio test to see if the amount of variation ascribed to each are different. If the explained variation is relatively large compared to the unexplained variation, then this ratio will be large. Larger \(F\) ratios are more extreme, when considering an F distribution. Thus, an extreme \(F\) ratio would imply that the two variances are not equal, and specifically, that the amount of variance explained is significantly greater than that unexplained.
Regression coefficients are statistics and thus we can determine the standard error of these statistics. From there, we can use these values and the t-distribution to determine confidence intervals. (Note that we can also see if these coefficients are significantly different from 0 using a Wald Test (similar to the \(t\) test).)
The confidence interval for \(b_1\) is:
\[ b_1 \pm t_{0.05, n-2}s_{b_{1}} \]
\[ s_{b_{1}} = \sqrt{ \frac{MS_{Residual}}{\sum^n_{i=1}(x_i-\bar{x})^2} } \]
\[ s_{b_{0}} = \sqrt{ MS_{Residual} [\frac{1}{n} + \frac{\bar{x}^2}{\sum^n_{i=1}(x_i-\bar{x})^2}] } \]
From Quinn and Keough, p. 87:
The 95% confidence band is a biconcave band that will contain the true population regression line 95% of the time.
\[ s_{\hat{y}} = \sqrt{ MS_{Residual} [1 + \frac{1}{n} + \frac{x_p - \bar{x}^2}{\sum^n_{i=1}(x_i-\bar{x})^2}] } \]
where \(x_p\) is the value from \(x\) we are “predicting” a \(y\) value for.
If we plot the lm
result, we will see a set of
diagnostic plots:
plot(flr_beetle_lm)
Leverage: a measure of how extreme a value in x-space is (i.e., along the x-axis) and how much of an influence each \(x_i\) has on \(\hat{y}_i\). High leverage values indicate that model is unduly influenced by an outlying value.
Residuals: the differences between the observed \(y_i\) values and the predicted \(\hat{y}_i\) values. Looking at the pattern between residuals and \(\hat{y}_i\) values can yield important information regarding violation of model assumptions (e.g., homogeneity of variance).
Cook’s D: a statistics that offers a measure of the influence of each data point on the fitted model that incorporates both leverage and residuals. Values \(\ge 1\) are considered “suspect”.
More details on linear regression diagnostics can be found in this video - https://youtu.be/pOvZsR5JW2w.