Section Goals

Brief lecture

Here’s a link to a lecture I use in my undergrad class Chi-squared Primer

\(\chi^2\) statistic

\[ \chi^2 = \sum \frac{(o-e)^2}{e} \]

where \(o =\) observed and \(e =\) expected frequencies.

Goodness of fit tests

To examine if some set of data is distributed in a specific way, or has a frequency based on a known distribution, we can use a \(\chi^2\) test.

Example - COVID deaths and race

First, let’s gather the information on COVID deaths from the CDC website here - https://covid.cdc.gov/covid-data-tracker/#demographics.

Second, open the data set in a Excel (or the equivalent).

Now, read in the data sheet.

covid = read.csv("../data/deaths_by_race_ethnicity_2025.csv")

Calculate \(\frac{(observed - expected)^2}{expected}\).

covid$obs_exp = ((covid$Death_Percent - covid$Population_Percent)^2) / covid$Population_Percent

Finally, calculate \(\chi^2\) and its probability.

chi_sq = sum(covid$obs_exp)
pchisq(q = chi_sq, df = 5, lower.tail = FALSE)
## [1] 0.3005773

Can also visualize the Chi-squared distribution with this Shiny app https://istats.shinyapps.io/ChisqDist/

R function goodness of fit

covid$Population_Percent_Mod <- covid$Population_Percent/sum(covid$Population_Percent)
chisq.test(covid$Death_Percent, p = covid$Population_Percent_Mod)
## Warning in chisq.test(covid$Death_Percent, p = covid$Population_Percent_Mod):
## Chi-squared approximation may be incorrect
## 
##  Chi-squared test for given probabilities
## 
## data:  covid$Death_Percent
## X-squared = 5.6561, df = 5, p-value = 0.3411

Contingency tables

Examine if two or more categorical variables are associated with each other. This is essentially an extension of the \(\chi^2\) analysis. The Null hypothesis is that the variables are not associated.

Example - Two-way contingency table

From Logan pp. 478 - 480

In order to investigate the mortality of coolibah (Eucalyptus coolibah) trees across riparian dunes, Roberts (1993) counted the number of quadrats in which dead trees were present and the number in which they were absent in three positions (top, middle and bottom) along transects from the lakeshore up to the top of dunes. In this case, the classification of quadrats according to the presence/absence of dead coolibah trees will be interpreted as a response variable and the position along transect as a predictor variable (see Box 14.3 of Quinn and Keough (2002)).

(If there is time, go through few slides on this.)

Get the data

trees = read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter16/Data/roberts.csv")

head(trees)
##   QUADRAT POSITION    DEAD
## 1       1   Bottom    With
## 2       2   Bottom    With
## 3       3   Bottom Without
## 4       4   Bottom Without
## 5       5      Top Without
## 6       6   Bottom    With

Make a contingency table

trees_xtab = table(trees$POSITION, trees$DEAD)
trees_xtab
##         
##          With Without
##   Bottom   15      13
##   Middle    4       8
##   Top       0      17

Run a \(\chi^2\) test

trees_chisq = chisq.test(trees_xtab)
## Warning in chisq.test(trees_xtab): Chi-squared approximation may be incorrect
trees_chisq
## 
##  Pearson's Chi-squared test
## 
## data:  trees_xtab
## X-squared = 13.661, df = 2, p-value = 0.00108