Let’s say our data don’t fit the assumptions of the t test. What are our options?
Look at ranks, rather than actual values.
From Q&K p. 47:
Powell and Russel (1984, 1985) investigated differences in bettle consumption between two size classes of eastern horned lizards.
Get the data and look at it
lizard <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter6/Data/beetle.csv")
head(lizard)
## SIZE BEETLES
## 1 small 256
## 2 small 209
## 3 small 0
## 4 small 0
## 5 small 0
## 6 small 44
summary(lizard)
## SIZE BEETLES
## large:21 Min. : 0.0
## small:24 1st Qu.: 0.0
## Median : 34.0
## Mean :112.7
## 3rd Qu.:179.0
## Max. :843.0
Have a look at the box plots for these data.
library(ggplot2)
ggplot() +
geom_boxplot(data = lizard, aes(x = SIZE, y = BEETLES) ) +
theme_bw()
Let’s also look at the histograms of these data
ggplot() +
geom_histogram(data = lizard, aes(x = BEETLES, fill = SIZE) ) +
facet_grid( . ~ SIZE ) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What do you observe here that seems like a strong violation to the assumptions of a t test?
## Run t-test
t.test(data = lizard, BEETLES ~ SIZE)
##
## Welch Two Sample t-test
##
## data: BEETLES by SIZE
## t = 2.1907, df = 27.015, p-value = 0.03728
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 6.862606 209.577870
## sample estimates:
## mean in group large mean in group small
## 170.42857 62.20833
## Record the original t-stat
t_orig <- t.test(data = lizard, BEETLES ~ SIZE)$statistic
## Initiate vector of new t-stat values
t_rand <- c()
## Set the number of reps (shuffles) we want to use
reps <- 1000
## Begin for loop to randomize data
for ( int in 1:reps){
## Shuffle the data
lizard_shuffled <- data.frame(SIZE = sample(lizard$SIZE), BEETLES = lizard$BEETLES)
## Run the t-test on the new data, and save the t-stat to the t_rand vector
t_rand <- c( t_rand,
t.test(data = lizard_shuffled, BEETLES ~ SIZE)$statistic )
}
ggplot() +
geom_histogram(data = NULL, aes(x = t_rand)) +
geom_vline(xintercept = t_orig, colour = "red") +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
To calculate how likely our original t statistic is, look at how many values are more extreme. Remember to check both tails.
t_rand_extreme <- sum(abs(t_rand) > t_orig)
(p_t_orig <- t_rand_extreme/ reps)
## [1] 0.02
If you perform a singly hypothesis test, using \(\alpha = 0.05\), what is the probability that you reject the null hypothesis, even though the null is correct?
If you perform 20 hypothesis tests, using \(\alpha = 0.05\), what is the probability that in at least one of those cases, you will reject the null hypothesis, even though it is correct?
Make some random data - 10 sets of 10 observations from the standard normal. We know that all of these sets are from the exact same population!
my_data <- list()
for (i in 1:20) {
my_data[[i]] <- rnorm(10) #Note the double brackets for a list
}
Run a t test for all possible pair-wise comparisons.
p_vals <- matrix(ncol = 20, nrow = 20)
for (i in 1:19) {
for (j in (i + 1):20) {
p_vals[i, j] <- t.test(my_data[[i]], my_data[[j]])$p.value
}
}
p_vals
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] NA 0.5338085 0.7030163 0.5510192 0.9787981 0.22618690 0.2579768
## [2,] NA NA 0.8021567 0.2582456 0.6529965 0.56563662 0.5825617
## [3,] NA NA NA 0.3533273 0.7957129 0.40235338 0.4290081
## [4,] NA NA NA NA 0.6358835 0.09659346 0.1188466
## [5,] NA NA NA NA NA 0.36242519 0.3783536
## [6,] NA NA NA NA NA NA 0.9880533
## [7,] NA NA NA NA NA NA NA
## [8,] NA NA NA NA NA NA NA
## [9,] NA NA NA NA NA NA NA
## [10,] NA NA NA NA NA NA NA
## [11,] NA NA NA NA NA NA NA
## [12,] NA NA NA NA NA NA NA
## [13,] NA NA NA NA NA NA NA
## [14,] NA NA NA NA NA NA NA
## [15,] NA NA NA NA NA NA NA
## [16,] NA NA NA NA NA NA NA
## [17,] NA NA NA NA NA NA NA
## [18,] NA NA NA NA NA NA NA
## [19,] NA NA NA NA NA NA NA
## [20,] NA NA NA NA NA NA NA
## [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 0.9542734 0.25335281 0.41001993 0.20340441 0.6712270 0.5821931
## [2,] 0.6081847 0.11709431 0.78128984 0.08886734 0.4276495 0.3127823
## [3,] 0.7732211 0.15961664 0.61556064 0.12314000 0.5185392 0.4063846
## [4,] 0.5535589 0.53373922 0.20876121 0.47276877 0.9575347 0.9506767
## [5,] 0.9846937 0.34656504 0.51830360 0.30644830 0.6941049 0.6373671
## [6,] 0.2909377 0.04456389 0.81575257 0.03113668 0.2465215 0.1425353
## [7,] 0.3166150 0.05512300 0.81513915 0.04091228 0.2564444 0.1593645
## [8,] NA 0.26804846 0.47031870 0.22278175 0.6571985 0.5758693
## [9,] NA NA 0.09892786 0.95672498 0.7030069 0.6294589
## [10,] NA NA NA 0.07828489 0.3463640 0.2483842
## [11,] NA NA NA NA 0.6662242 0.5798501
## [12,] NA NA NA NA NA 0.9959046
## [13,] NA NA NA NA NA NA
## [14,] NA NA NA NA NA NA
## [15,] NA NA NA NA NA NA
## [16,] NA NA NA NA NA NA
## [17,] NA NA NA NA NA NA
## [18,] NA NA NA NA NA NA
## [19,] NA NA NA NA NA NA
## [20,] NA NA NA NA NA NA
## [,14] [,15] [,16] [,17] [,18] [,19]
## [1,] 0.9775804 0.9636612 0.06961829 0.8370927 0.5324950 0.7379107
## [2,] 0.6395795 0.5698005 0.24403364 0.5072546 0.2661383 0.8245385
## [3,] 0.7707658 0.7419576 0.14982830 0.6303185 0.3553082 0.9988697
## [4,] 0.6977623 0.5314059 0.02706144 0.8071017 0.9293482 0.4101296
## [5,] 0.9643123 0.9940383 0.16683488 0.8469062 0.6058162 0.8097398
## [6,] 0.3696110 0.2507325 0.55352983 0.2666102 0.1097263 0.4545389
## [7,] 0.3827074 0.2813331 0.59158923 0.2830299 0.1283341 0.4731324
## [8,] 0.9459608 0.9877318 0.10647601 0.8126083 0.5323880 0.7954397
## [9,] 0.4020058 0.2459321 0.01370775 0.4643806 0.6189187 0.1977139
## [10,] 0.5130753 0.4373614 0.44808185 0.4017407 0.2130721 0.6466409
## [11,] 0.3633749 0.1980316 0.00859410 0.4204226 0.5642642 0.1623194
## [12,] 0.7343552 0.6542022 0.12602870 0.8208330 0.9926998 0.5384207
## [13,] 0.6902781 0.5636728 0.05306805 0.7884814 0.9854269 0.4443476
## [14,] NA 0.9529705 0.18184461 0.8898486 0.6636165 0.7843396
## [15,] NA NA 0.08102365 0.8132281 0.5146279 0.7712518
## [16,] NA NA NA 0.1168114 0.0356328 0.1977529
## [17,] NA NA NA NA 0.7634905 0.6544567
## [18,] NA NA NA NA NA 0.4016445
## [19,] NA NA NA NA NA NA
## [20,] NA NA NA NA NA NA
## [,20]
## [1,] 0.70945264
## [2,] 0.38574756
## [3,] 0.50023110
## [4,] 0.89086820
## [5,] 0.74802449
## [6,] 0.17575577
## [7,] 0.19598985
## [8,] 0.69414400
## [9,] 0.49553230
## [10,] 0.30357144
## [11,] 0.44425659
## [12,] 0.88460128
## [13,] 0.86136732
## [14,] 0.79942619
## [15,] 0.68635639
## [16,] 0.06386938
## [17,] 0.91102597
## [18,] 0.83686948
## [19,] 0.53831014
## [20,] NA
How many are false possitives?
sum(p_vals < 0.05, na.rm = TRUE)
## [1] 7
From Logan, p. 151:
A statistical model is an expression that attempts to explain patterns in the observed values of a response variable by relating the response variable to a set of predictor variables and parameters.
A generic model looks like:
response variable = model + error
\[ y = bx + a \]
where \(b\) is the slope and \(a\) is the y-intercept.
Written Notes - pages 1-3
Calcualte covariation between sepal length and sepal width, using the iris
dataset.
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
library(ggplot2)
ggplot(data = iris) +
geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
theme_bw()
covar <- sum((iris$Sepal.Length - mean(iris$Sepal.Length)) * (iris$Sepal.Width - mean(iris$Sepal.Width)))/ (nrow(iris) - 1)
## Confirm
var(x = iris$Sepal.Length, y = iris$Sepal.Width)
## [1] -0.042434
Written Notes - Page 4
Calculate the correlation between sepal length and sepal width, and calculate whether \(r\) is significantly different from 0 or not.
cor_iris <- covar / (sd(iris$Sepal.Length)*sd(iris$Sepal.Width))
se_cor_iris <- sqrt((1 - cor_iris^2)/(nrow(iris) - 2))
t_cor_iris <- cor_iris/se_cor_iris
## Calculate the probability of getting our t value, or one more extreme
pt(q = t_cor_iris, df = (nrow(iris)-2))*2
## [1] 0.1518983
## Use `cor` to find correlation value
with(data = iris, cor(x = Sepal.Length, y = Sepal.Width, method = "pearson"))
## [1] -0.1175698
## Test the correlation
with(data = iris, cor.test(x = Sepal.Length, y = Sepal.Width, method = "pearson"))
##
## Pearson's product-moment correlation
##
## data: Sepal.Length and Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.27269325 0.04351158
## sample estimates:
## cor
## -0.1175698
For Pearson’s correlation coefficient to be appropriate, the data should have: