The answers to these questions, with respect to this course, are primarily based on frequentist statistics.
Let’s say we have a sample of observations that we want to know if they come from some population with a known value for \(\mu\) (i.e., we know the population mean). We can see how (un)likely it is that the sample estimates come from this particular population, by looking at where these values fall in a t distribution. That is, calculate the t statistic:
\[ t_s = \frac{\bar{y} - \mu}{s_{\bar{y}}} \]
General concencus is \(p = 0.05\). This is known as Type-I error, or \(\alpha\).
The reciprical of Type-II error (\(\beta\)) is power.
\[ power(1-\beta) \propto \frac{ES \sqrt{n} \alpha}{\sigma} \]
Increase the sample size.
See p. 33-34 in Q&K. Fisher’s approach below:
What are some of the potential problems with this approach?
When we use a sampling distribution for our test statistic (e.g., the t distribution), we are asking “what is the probability of observing our data, or something more extreme, in the long run, if \(H_0\) is true.” Mathematically, this can be written as:
\[ P(data|H_0) \]
Below is the generic form of the t statistic:
\[ t_s = \frac{St - \theta}{S_{St}} \]
where \(St\) is the value of some statistic (e.g., the mean) from our sample, \(\theta\) is the population value for that statistic, and \(S_{St}\) is the estimated standard error of the statistic \(St\) (based on our sample).
How can we use this formula to test whether two samples are drawn from the same population?
Imagine the case where we have two different samples, and for each we’re testing whether the means are different from the population means. We then have:
\[ t = \frac{\bar{y_1}-\mu_1}{s_{\bar{y}_1}} \]
and
\[ t = \frac{\bar{y_2}-\mu_2}{s_{\bar{y}_2}} \]
If the two samples are drawn from the same population, then \(\mu_1 = \mu_2\), or \(\mu_1 - \mu_2 = 0\).
We can then write our t stat as:
\[ t = \frac{(\bar{y_1} - \bar{y_2}) - (\mu_1 - \mu_2)}{s_{\bar{y}_1 - \bar{y}_2}} \]
which simplifies to:
\[ t = \frac{\bar{y_1} - \bar{y_2}}{s_{\bar{y}_1 - \bar{y}_2}} \]
where \(s_{\bar{y}_1 - \bar{y}_2}\) is the standard error of the difference between the means and is equal to:
\[ s_{\bar{y}_1 - \bar{y}_2} = \sqrt{ \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 -2} (\frac{1}{n_1} + \frac{1}{n_2}) } \]
Ward and Quinn (1988) investigated the differences in fecundity of Lepsiella vinosa in two different intertidal zones (mussel zone and littorinid zone).
Get the data and have a quick look
gastro <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter6/Data/ward.csv")
summary(gastro)
## ZONE EGGS
## Littor:37 Min. : 5.00
## Mussel:42 1st Qu.: 8.50
## Median :10.00
## Mean :10.11
## 3rd Qu.:12.00
## Max. :18.00
Make a box plot to help assess differences in variance and deviations from normality.
library(ggplot2)
ggplot() +
geom_boxplot(data = gastro, aes(x = ZONE, y = EGGS)) +
theme_bw()
Calculate means and standard deviations of each group separatley. We will be using dplyr
for this.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
gastro %>%
group_by(ZONE) %>%
summarise(Mean = mean(EGGS), Var = var(EGGS))
## # A tibble: 2 x 3
## ZONE Mean Var
## <fct> <dbl> <dbl>
## 1 Littor 8.70 4.10
## 2 Mussel 11.4 5.36
Run t test.
(gastro_t_test <- t.test(data = gastro, EGGS ~ ZONE, var.equal = TRUE))
##
## Two Sample t-test
##
## data: EGGS by ZONE
## t = -5.3899, df = 77, p-value = 7.457e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.63511 -1.67377
## sample estimates:
## mean in group Littor mean in group Mussel
## 8.702703 11.357143
gastro_t_test$estimate[1] - gastro_t_test$estimate[2]
## mean in group Littor
## -2.65444
Furness and Bryant (1996) measured the metabolic rates of male and female breeding northern fulmars, and tested if there were any observalbe differences in these rates.
Get the data and have a look
fulmars <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter6/Data/furness.csv")
fulmars
## SEX METRATE BODYMASS
## 1 Male 2950.0 875
## 2 Female 1956.1 635
## 3 Male 2308.7 765
## 4 Male 2135.6 780
## 5 Male 1945.6 790
## 6 Female 1490.5 635
## 7 Female 1361.3 668
## 8 Female 1086.5 640
## 9 Female 1091.0 645
## 10 Male 1195.5 788
## 11 Female 727.7 635
## 12 Male 843.3 855
## 13 Male 525.8 860
## 14 Male 605.7 1005
summary(fulmars)
## SEX METRATE BODYMASS
## Female :6 Min. : 525.8 Min. : 635.0
## Male :8 1st Qu.: 904.1 1st Qu.: 641.2
## Median :1278.4 Median : 772.5
## Mean :1444.5 Mean : 755.4
## 3rd Qu.:1953.5 3rd Qu.: 838.8
## Max. :2950.0 Max. :1005.0
Make a box plot to help assess differences in variance and deviations from normality.
ggplot() +
geom_boxplot(data = fulmars, aes(x = SEX, y = METRATE)) +
theme_bw()
Are the variances the same?
Calculate means and standard deviations of each group separatley. We will be using dplyr
for this.
fulmars %>%
group_by(SEX) %>%
summarise(Mean = mean(METRATE), Var = var(METRATE))
## # A tibble: 2 x 3
## SEX Mean Var
## <fct> <dbl> <dbl>
## 1 "Female " 1286. 177209.
## 2 "Male " 1564. 799903.
Based on inequality of variances, use Welch’s t-test.
(fulmars_t_test <- t.test(data = fulmars, METRATE ~ SEX, var.equal = FALSE))
##
## Welch Two Sample t-test
##
## data: METRATE by SEX
## t = -0.77317, df = 10.468, p-value = 0.4565
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1075.3208 518.8042
## sample estimates:
## mean in group Female mean in group Male
## 1285.517 1563.775