Here’s a link to a lecture I use in my undergrad class Chi-squared Primer
\[ \chi^2 = \sum \frac{(o-e)^2}{e} \]
where \(o =\) observed and \(e =\) expected frequencies.
To examine if some set of data is distributed in a specific way, or has a frequency based on a known distribution, we can use a \(\chi^2\) test.
First, let’s gather the information on COVID deaths from the CDC website here - https://covid.cdc.gov/covid-data-tracker/#demographics.
Second, open the data set in a Excel (or the equivalent).
Now, read in the data sheet.
covid = read.csv("../data/deaths_by_race_ethnicity_2025.csv")
Calculate \(\frac{(observed - expected)^2}{expected}\).
covid$obs_exp = ((covid$Death_Percent - covid$Population_Percent)^2) / covid$Population_Percent
Finally, calculate \(\chi^2\) and its probability.
chi_sq = sum(covid$obs_exp)
pchisq(q = chi_sq, df = 5, lower.tail = FALSE)
## [1] 0.3005773
Can also visualize the Chi-squared distribution with this Shiny app https://istats.shinyapps.io/ChisqDist/
covid$Population_Percent_Mod <- covid$Population_Percent/sum(covid$Population_Percent)
chisq.test(covid$Death_Percent, p = covid$Population_Percent_Mod)
## Warning in chisq.test(covid$Death_Percent, p = covid$Population_Percent_Mod):
## Chi-squared approximation may be incorrect
##
## Chi-squared test for given probabilities
##
## data: covid$Death_Percent
## X-squared = 5.6561, df = 5, p-value = 0.3411
Examine if two or more categorical variables are associated with each other. This is essentially an extension of the \(\chi^2\) analysis. The Null hypothesis is that the variables are not associated.
From Logan pp. 478 - 480
In order to investigate the mortality of coolibah (Eucalyptus coolibah) trees across riparian dunes, Roberts (1993) counted the number of quadrats in which dead trees were present and the number in which they were absent in three positions (top, middle and bottom) along transects from the lakeshore up to the top of dunes. In this case, the classification of quadrats according to the presence/absence of dead coolibah trees will be interpreted as a response variable and the position along transect as a predictor variable (see Box 14.3 of Quinn and Keough (2002)).
(If there is time, go through few slides on this.)
Get the data
trees = read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter16/Data/roberts.csv")
head(trees)
## QUADRAT POSITION DEAD
## 1 1 Bottom With
## 2 2 Bottom With
## 3 3 Bottom Without
## 4 4 Bottom Without
## 5 5 Top Without
## 6 6 Bottom With
Make a contingency table
trees_xtab = table(trees$POSITION, trees$DEAD)
trees_xtab
##
## With Without
## Bottom 15 13
## Middle 4 8
## Top 0 17
Run a \(\chi^2\) test
trees_chisq = chisq.test(trees_xtab)
## Warning in chisq.test(trees_xtab): Chi-squared approximation may be incorrect
trees_chisq
##
## Pearson's Chi-squared test
##
## data: trees_xtab
## X-squared = 13.661, df = 2, p-value = 0.00108