t.test
The answers to these questions, with respect to this course, are primarily based on frequentest statistics.
Let’s say we have a sample of observations that we want to know if they come from some population with a known value for \(\mu\) (i.e., we know the population mean). We can see how (un)likely it is that the sample estimates come from this particular population, by looking at where these values fall in a t distribution. That is, calculate the t statistic:
\[ t_s = \frac{\bar{y} - \mu}{s_{\bar{y}}} \]
We are using the t distribution because we are using an approximation of standard error, \(s_{\bar{y}}\), based on data in our sample.
General consensus is \(p = 0.05\). This is known as Type-I error, or \(\alpha\).
The reciprocal of Type-II error (\(\beta\)) is power.
\[ power(1-\beta) \propto \frac{ES \sqrt{n} \alpha}{\sigma} \]
where \(ES\) is effect size, \(n\) is the sample size, \(\alpha\) is the accepted Type I error rate, and \(\sigma\) is the standard deviation.
Increase the sample size.
See p. 33-34 in Q&K. Fisher’s approach below:
What are some of the potential problems with this approach?
When we use a sampling distribution for our test statistic (e.g., the t distribution), we are asking “what is the probability of observing our data, or something more extreme, in the long run, if \(H_0\) is true.” Mathematically, this can be written as:
\[ P(data|H_0) \]
Below is the generic form of the t statistic:
\[ t_s = \frac{St - \theta}{S_{St}} \]
where \(St\) is the value of some statistic (e.g., the mean) from our sample, \(\theta\) is the population value for that statistic, and \(S_{St}\) is the estimated standard error of the statistic \(St\) (based on our sample).
How can we use this formula to test whether two samples are drawn from the same population?
Imagine the case where we have two different samples, and for each we’re testing whether the means are different from the population means. We then have:
\[ t_1 = \frac{\bar{y_1}-\mu_1}{s_{\bar{y}_1}} \]
and
\[ t_2 = \frac{\bar{y_2}-\mu_2}{s_{\bar{y}_2}} \]
If the two samples are drawn from the same population, then \(\mu_1 = \mu_2\), or \(\mu_1 - \mu_2 = 0\).
We can then write our t stat as:
\[ t = \frac{(\bar{y_1} - \bar{y_2}) - (\mu_1 - \mu_2)}{s_{\bar{y}_1 - \bar{y}_2}} \]
which simplifies to:
\[ t = \frac{\bar{y_1} - \bar{y_2}}{s_{\bar{y}_1 - \bar{y}_2}} \]
where \(s_{\bar{y}_1 - \bar{y}_2}\) is the standard error of the difference between the means and is equal to:
\[ s_{\bar{y}_1 - \bar{y}_2} = \sqrt{ \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 -2} (\frac{1}{n_1} + \frac{1}{n_2}) } \]
Ward and Quinn (1988) investigated the differences in fecundity of Lepsiella vinosa in two different intertidal zones (mussel zone and littorinid zone).
Get the data and have a quick look
gastro <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter6/Data/ward.csv")
summary(gastro)
## ZONE EGGS
## Length:79 Min. : 5.00
## Class :character 1st Qu.: 8.50
## Mode :character Median :10.00
## Mean :10.11
## 3rd Qu.:12.00
## Max. :18.00
Make a box plot to help assess differences in variance and deviations from normality.
library(ggplot2)
ggplot() +
geom_boxplot(data = gastro, aes(x = ZONE, y = EGGS)) +
theme_bw()
Calculate means and standard deviations of each group separately. We
will be using dplyr
for this.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
gastro_summary <-
gastro %>%
group_by(ZONE) %>%
summarise(Mean = mean(EGGS),
Var = var(EGGS),
SD = sd(EGGS),
SE = sd(EGGS)/sqrt(length(ZONE)))
gastro_summary
## # A tibble: 2 × 5
## ZONE Mean Var SD SE
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Littor 8.70 4.10 2.03 0.333
## 2 Mussel 11.4 5.36 2.31 0.357
First, let’s calculate \(s_{\bar{y}_1 - \bar{y}_2}\) using the formula noted above:
n_littor <- sum(gastro$ZONE == "Littor")
n_mussel <- sum(gastro$ZONE == "Mussel")
sd_littor <- gastro_summary$SD[1]
sd_mussel <- gastro_summary$SD[2]
s_1_2 <- sqrt( (((n_littor - 1)*sd_littor^2 + (n_mussel - 1)*sd_mussel^2 )/
(n_littor + n_mussel -2)) *
((1/n_littor) + (1/n_mussel)) )
mean_littor <- gastro_summary$Mean[1]
mean_mussel <- gastro_summary$Mean[2]
tstat <- (mean_littor - mean_mussel) / s_1_2
OK, now we have our t statistic. In order to use this to say something about the probability that these two sample come from a population with the same mean, we need to know what degrees of freedom to use to parameterize our t distribution. Here’s the simple formula:
\[ df = (n_1 - 1) + (n_2 - 1) \]
So now we can say that the df here is:
df_gastro <- (n_littor - 1) + (n_mussel - 1)
pt(q = tstat, df = df_gastro) * 2
## [1] 7.457222e-07
The resulting probability value is very small, much less than 0.05. In this case, we would reject the null hypothesis that these two samples come from the same population.
t.test
function(gastro_t_test <- t.test(data = gastro, EGGS ~ ZONE, var.equal = TRUE))
##
## Two Sample t-test
##
## data: EGGS by ZONE
## t = -5.3899, df = 77, p-value = 7.457e-07
## alternative hypothesis: true difference in means between group Littor and group Mussel is not equal to 0
## 95 percent confidence interval:
## -3.63511 -1.67377
## sample estimates:
## mean in group Littor mean in group Mussel
## 8.702703 11.357143
gastro_t_test$estimate[1] - gastro_t_test$estimate[2]
## mean in group Littor
## -2.65444