Non-parametric tests
- Rank based tests - Mann-Whitney test
- Randomization tests
Multiple testing
- Example of problems with multiple comparisons
Introduction to Linear Models

Non-parametric tests

Let’s say our data don’t fit the assumptions of the t test. What are our options?

Rank based tests - Mann-Whitney test

Look at ranks, rather than actual values.

From Q&K p. 47:

Rank all observations, ignoring groups. Tied observations get the average of their ranks.
Calculate the sum of the ranks for both samples. If \(H_0\) is true, then you should expect similar sums.
Compare the smaller rank sum to the probability distribution of rank sums, based on repeated randomization of observations to groups.
If sample sizes are large, the probability distribution of rank sums approximates a normal distribution and the z statistic can be used.

Randomization tests

Powell and Russel (1984, 1985) investigated differences in bettle consumption between two size classes of eastern horned lizards.

Get the data and look at it

lizard <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter6/Data/beetle.csv")
head(lizard)

##    SIZE BEETLES
## 1 small     256
## 2 small     209
## 3 small       0
## 4 small       0
## 5 small       0
## 6 small      44

summary(lizard)

##     SIZE       BEETLES     
##  large:21   Min.   :  0.0  
##  small:24   1st Qu.:  0.0  
##             Median : 34.0  
##             Mean   :112.7  
##             3rd Qu.:179.0  
##             Max.   :843.0

Have a look at the box plots for these data.

library(ggplot2)

ggplot() +
  geom_boxplot(data = lizard, aes(x = SIZE, y = BEETLES) ) +
  theme_bw()

Let’s also look at the histograms of these data

ggplot() +
  geom_histogram(data = lizard, aes(x = BEETLES, fill = SIZE) ) +
  facet_grid( . ~ SIZE ) +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Challenge

What do you observe here that seems like a strong violation to the assumptions of a t test?

Major Challenge

Calculate the t-statistic for differences in number of beetles consumed by the two size classes of lizards.
Randomize/ shuffle the data.
Calculate the new t-statistic. Repeat this 999 times.
How does the observed t-stat compare to the generated t-stats?

## Run t-test
t.test(data = lizard, BEETLES ~ SIZE)

## 
##  Welch Two Sample t-test
## 
## data:  BEETLES by SIZE
## t = 2.1907, df = 27.015, p-value = 0.03728
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##    6.862606 209.577870
## sample estimates:
## mean in group large mean in group small 
##           170.42857            62.20833

## Record the original t-stat
t_orig <- t.test(data = lizard, BEETLES ~ SIZE)$statistic

## Initiate vector of new t-stat values
t_rand <- c()

## Set the number of reps (shuffles) we want to use
reps <- 1000

## Begin for loop to randomize data
for ( int in 1:reps){
  
  ## Shuffle the data
  lizard_shuffled <- data.frame(SIZE = sample(lizard$SIZE), BEETLES = lizard$BEETLES)
  
  ## Run the t-test on the new data, and save the t-stat to the t_rand vector
  t_rand <- c( t_rand,
               t.test(data = lizard_shuffled, BEETLES ~ SIZE)$statistic )
  
}

ggplot() + 
  geom_histogram(data = NULL, aes(x = t_rand)) +
  geom_vline(xintercept = t_orig, colour = "red") +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

To calculate how likely our original t statistic is, look at how many values are more extreme. Remember to check both tails.

t_rand_extreme <- sum(abs(t_rand) > t_orig)

(p_t_orig <- t_rand_extreme/ reps)

## [1] 0.02

Multiple testing

If you perform a singly hypothesis test, using \(\alpha = 0.05\), what is the probability that you reject the null hypothesis, even though the null is correct?
If you perform 20 hypothesis tests, using \(\alpha = 0.05\), what is the probability that in at least one of those cases, you will reject the null hypothesis, even though it is correct?

Example of problems with multiple comparisons

Make some random data - 10 sets of 10 observations from the standard normal. We know that all of these sets are from the exact same population!

my_data <- list()
for (i in 1:20) {
    my_data[[i]] <- rnorm(10)  #Note the double brackets for a list
}

Run a t test for all possible pair-wise comparisons.

p_vals <- matrix(ncol = 20, nrow = 20)
for (i in 1:19) {
    for (j in (i + 1):20) {
        p_vals[i, j] <- t.test(my_data[[i]], my_data[[j]])$p.value
    }
}
p_vals

##       [,1]      [,2]      [,3]      [,4]      [,5]       [,6]      [,7]
##  [1,]   NA 0.5338085 0.7030163 0.5510192 0.9787981 0.22618690 0.2579768
##  [2,]   NA        NA 0.8021567 0.2582456 0.6529965 0.56563662 0.5825617
##  [3,]   NA        NA        NA 0.3533273 0.7957129 0.40235338 0.4290081
##  [4,]   NA        NA        NA        NA 0.6358835 0.09659346 0.1188466
##  [5,]   NA        NA        NA        NA        NA 0.36242519 0.3783536
##  [6,]   NA        NA        NA        NA        NA         NA 0.9880533
##  [7,]   NA        NA        NA        NA        NA         NA        NA
##  [8,]   NA        NA        NA        NA        NA         NA        NA
##  [9,]   NA        NA        NA        NA        NA         NA        NA
## [10,]   NA        NA        NA        NA        NA         NA        NA
## [11,]   NA        NA        NA        NA        NA         NA        NA
## [12,]   NA        NA        NA        NA        NA         NA        NA
## [13,]   NA        NA        NA        NA        NA         NA        NA
## [14,]   NA        NA        NA        NA        NA         NA        NA
## [15,]   NA        NA        NA        NA        NA         NA        NA
## [16,]   NA        NA        NA        NA        NA         NA        NA
## [17,]   NA        NA        NA        NA        NA         NA        NA
## [18,]   NA        NA        NA        NA        NA         NA        NA
## [19,]   NA        NA        NA        NA        NA         NA        NA
## [20,]   NA        NA        NA        NA        NA         NA        NA
##            [,8]       [,9]      [,10]      [,11]     [,12]     [,13]
##  [1,] 0.9542734 0.25335281 0.41001993 0.20340441 0.6712270 0.5821931
##  [2,] 0.6081847 0.11709431 0.78128984 0.08886734 0.4276495 0.3127823
##  [3,] 0.7732211 0.15961664 0.61556064 0.12314000 0.5185392 0.4063846
##  [4,] 0.5535589 0.53373922 0.20876121 0.47276877 0.9575347 0.9506767
##  [5,] 0.9846937 0.34656504 0.51830360 0.30644830 0.6941049 0.6373671
##  [6,] 0.2909377 0.04456389 0.81575257 0.03113668 0.2465215 0.1425353
##  [7,] 0.3166150 0.05512300 0.81513915 0.04091228 0.2564444 0.1593645
##  [8,]        NA 0.26804846 0.47031870 0.22278175 0.6571985 0.5758693
##  [9,]        NA         NA 0.09892786 0.95672498 0.7030069 0.6294589
## [10,]        NA         NA         NA 0.07828489 0.3463640 0.2483842
## [11,]        NA         NA         NA         NA 0.6662242 0.5798501
## [12,]        NA         NA         NA         NA        NA 0.9959046
## [13,]        NA         NA         NA         NA        NA        NA
## [14,]        NA         NA         NA         NA        NA        NA
## [15,]        NA         NA         NA         NA        NA        NA
## [16,]        NA         NA         NA         NA        NA        NA
## [17,]        NA         NA         NA         NA        NA        NA
## [18,]        NA         NA         NA         NA        NA        NA
## [19,]        NA         NA         NA         NA        NA        NA
## [20,]        NA         NA         NA         NA        NA        NA
##           [,14]     [,15]      [,16]     [,17]     [,18]     [,19]
##  [1,] 0.9775804 0.9636612 0.06961829 0.8370927 0.5324950 0.7379107
##  [2,] 0.6395795 0.5698005 0.24403364 0.5072546 0.2661383 0.8245385
##  [3,] 0.7707658 0.7419576 0.14982830 0.6303185 0.3553082 0.9988697
##  [4,] 0.6977623 0.5314059 0.02706144 0.8071017 0.9293482 0.4101296
##  [5,] 0.9643123 0.9940383 0.16683488 0.8469062 0.6058162 0.8097398
##  [6,] 0.3696110 0.2507325 0.55352983 0.2666102 0.1097263 0.4545389
##  [7,] 0.3827074 0.2813331 0.59158923 0.2830299 0.1283341 0.4731324
##  [8,] 0.9459608 0.9877318 0.10647601 0.8126083 0.5323880 0.7954397
##  [9,] 0.4020058 0.2459321 0.01370775 0.4643806 0.6189187 0.1977139
## [10,] 0.5130753 0.4373614 0.44808185 0.4017407 0.2130721 0.6466409
## [11,] 0.3633749 0.1980316 0.00859410 0.4204226 0.5642642 0.1623194
## [12,] 0.7343552 0.6542022 0.12602870 0.8208330 0.9926998 0.5384207
## [13,] 0.6902781 0.5636728 0.05306805 0.7884814 0.9854269 0.4443476
## [14,]        NA 0.9529705 0.18184461 0.8898486 0.6636165 0.7843396
## [15,]        NA        NA 0.08102365 0.8132281 0.5146279 0.7712518
## [16,]        NA        NA         NA 0.1168114 0.0356328 0.1977529
## [17,]        NA        NA         NA        NA 0.7634905 0.6544567
## [18,]        NA        NA         NA        NA        NA 0.4016445
## [19,]        NA        NA         NA        NA        NA        NA
## [20,]        NA        NA         NA        NA        NA        NA
##            [,20]
##  [1,] 0.70945264
##  [2,] 0.38574756
##  [3,] 0.50023110
##  [4,] 0.89086820
##  [5,] 0.74802449
##  [6,] 0.17575577
##  [7,] 0.19598985
##  [8,] 0.69414400
##  [9,] 0.49553230
## [10,] 0.30357144
## [11,] 0.44425659
## [12,] 0.88460128
## [13,] 0.86136732
## [14,] 0.79942619
## [15,] 0.68635639
## [16,] 0.06386938
## [17,] 0.91102597
## [18,] 0.83686948
## [19,] 0.53831014
## [20,]         NA

How many are false possitives?

sum(p_vals < 0.05, na.rm = TRUE)

## [1] 7

Introduction to Linear Models

Statistical Models

From Logan, p. 151:

A statistical model is an expression that attempts to explain patterns in the observed values of a response variable by relating the response variable to a set of predictor variables and parameters.

A generic model looks like:

response variable = model + error

Simple linear model

\[ y = bx + a \]

where \(b\) is the slope and \(a\) is the y-intercept.

Written Notes - pages 1-3

Co-variation

Calcualte covariation between sepal length and sepal width, using the iris dataset.

data(iris)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

library(ggplot2)
ggplot(data = iris) +
  geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) +
  theme_bw()

covar <- sum((iris$Sepal.Length - mean(iris$Sepal.Length)) * (iris$Sepal.Width - mean(iris$Sepal.Width)))/ (nrow(iris) - 1)

## Confirm
var(x = iris$Sepal.Length, y = iris$Sepal.Width)

## [1] -0.042434

Written Notes - Page 4

Correlation

Calculate the correlation between sepal length and sepal width, and calculate whether \(r\) is significantly different from 0 or not.

cor_iris <- covar / (sd(iris$Sepal.Length)*sd(iris$Sepal.Width))

se_cor_iris <- sqrt((1 - cor_iris^2)/(nrow(iris) - 2))

t_cor_iris <- cor_iris/se_cor_iris

## Calculate the probability of getting our t value, or one more extreme
pt(q = t_cor_iris, df = (nrow(iris)-2))*2

## [1] 0.1518983

## Use `cor` to find correlation value
with(data = iris, cor(x = Sepal.Length, y = Sepal.Width, method = "pearson"))

## [1] -0.1175698

## Test the correlation
with(data = iris, cor.test(x = Sepal.Length, y = Sepal.Width, method = "pearson"))

## 
##  Pearson's product-moment correlation
## 
## data:  Sepal.Length and Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.27269325  0.04351158
## sample estimates:
##        cor 
## -0.1175698

Robust correlation

For Pearson’s correlation coefficient to be appropriate, the data should have:

A linear relationship
A bivariate normal distribution

Spearman’s Rank Correlation - calculates correlation on ranks of observed values
Kendall’s Rank Correlation

Meeting 10 - Correlation and Regression

Matthew E. Aiello-Lammens