In the limit as \(n \to \infty\), the probability distribution of means of samples from any distribution will approach a normal distribution.
We use sample statistics to estimate population statistics. In most cases in biology, populations are too large to measure population parameters directly. Therefore, we use different estimators to calculate the populations statistics based on the sample statistics.
In this lesson, we’re going to explore variation in point estimates and Confidence Intervals. We will largely be following content that can be found in the textbook OpenIntro Statistics by Diez, Barr, and Çetinkaya-Rundel.
We’ll be working with the Youth Risk Behavior Surveillance System (YRBSS). Here, we will start be examining the full data set.
Let’s begin by loading the data into our environment.
You may need to first install the openintro
R
package
install.packages("openintro")
After that, we can load in the data.
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
data(yrbss)
As with any data set, let’s have a quick look at the data using
summary
.
summary(yrbss)
## age gender grade hispanic
## Min. :12.00 Length:13583 Length:13583 Length:13583
## 1st Qu.:15.00 Class :character Class :character Class :character
## Median :16.00 Mode :character Mode :character Mode :character
## Mean :16.16
## 3rd Qu.:17.00
## Max. :18.00
## NA's :77
## race height weight helmet_12m
## Length:13583 Min. :1.270 Min. : 29.94 Length:13583
## Class :character 1st Qu.:1.600 1st Qu.: 56.25 Class :character
## Mode :character Median :1.680 Median : 64.41 Mode :character
## Mean :1.691 Mean : 67.91
## 3rd Qu.:1.780 3rd Qu.: 76.20
## Max. :2.110 Max. :180.99
## NA's :1004 NA's :1004
## text_while_driving_30d physically_active_7d hours_tv_per_school_day
## Length:13583 Min. :0.000 Length:13583
## Class :character 1st Qu.:2.000 Class :character
## Mode :character Median :4.000 Mode :character
## Mean :3.903
## 3rd Qu.:7.000
## Max. :7.000
## NA's :273
## strength_training_7d school_night_hours_sleep
## Min. :0.00 Length:13583
## 1st Qu.:0.00 Class :character
## Median :3.00 Mode :character
## Mean :2.95
## 3rd Qu.:5.00
## Max. :7.00
## NA's :1176
For now, let’s focus on the variable height
. First,
let’s calculate the population mean for
height
. Remember, that the population mean is going to be
the mean value for the full data set.
mean(yrbss$height)
## [1] NA
That’s probably not the answer you were expecting. There must have
been students whose data was collected, but their heights weren’t
recorded, so a value of NA was assigned. In order to ignore
these data, we need to add the argument na.rm = TRUE
to our
mean
call.
mean(yrbss$height, na.rm = TRUE)
## [1] 1.691241
OK, let’s assume that we could only collect data from a subset of the high school students whose data are in this data set. We can use the data in this subset to estimate population level parameters. This subset would be called our sample.
We’ll choose a subset at random, but we’ll make it so that everyone in the class gets the same random subset.
set.seed(1981)
yrbss_samp <- yrbss[sample(x = 1:nrow(yrbss), replace = FALSE, size = 10), ]
To do this, you should make sure to look at the help for
sample
.
Let’s calculate the mean height based on our sample.
mean(yrbss_samp$height, na.rm = TRUE)
## [1] 1.696667
How does it compare to the true mean?
Let’s try increasing our sample size, and see if that makes a difference?
set.seed(1981)
yrbss_samp <- yrbss[sample(x = 1:nrow(yrbss), replace = FALSE, size = 100), ]
mean(yrbss_samp$height, na.rm = TRUE)
## [1] 1.684362
set.seed(1981)
yrbss_samp <- yrbss[sample(x = 1:nrow(yrbss), replace = FALSE, size = 1000), ]
mean(yrbss_samp$height, na.rm = TRUE)
## [1] 1.682717
In general, though not always, as you increase sample size, you get closer to the true population values for an estimated parameter.
Let’s imagine the we took 100 different subsets, or samples, from these data, then calculated the mean for each of these samples, and then examined the histogram of these means. The shape of the histogram should prove to be that of a normal distribution.
In order to demonstrate this, we’ll need to use for
loops.
# start with a vector to store our mean values
samp_means <- c()
# start a for loop the does 100 iterations
for(x in 1:100){
# get a sample of the data of size 100
yrbss_samp <- yrbss[sample(x = 1:nrow(yrbss), replace = FALSE, size = 100), ]
# calulate the mean of this sample
mean_samp <- mean(yrbss_samp$height, na.rm = TRUE)
# add this mean to the vector of sample means
samp_means <- c(samp_means, mean_samp)
}
We should now have 100 values in the vector samp_means
.
Let’s look at a histogram of these values.
hist(samp_means, breaks = 20)
Alright, it’s normalish (i.e., approximately normal), though clearly not perfectly normal.
mean(yrbss$height, na.rm = TRUE)
mean(samp_means)
sd(yrbss$height, na.rm = TRUE)
sd(samp_means)
The means are very similar, but the sds are not. Why?
The standard error of the mean is the standard deviation of the estimates.
In practice, we only examine one, or a very few, samples. We can still compute the standard error of the mean using the formula:
\[ SE = \frac{\sigma}{\sqrt{n}} \]
, where \(\sigma\) is the population standard deviation and \(n\) is the sample size.
However we almost never know \(\sigma\), so we estimate it with our sample standard deviation value.
We just demonstrated that sample means are approximately normally distributed. This is a consequence of the Central Limit Theorem.
What can we do with this? To start, we can say something about the precision and accuracy of our estimates of the population mean based on the sample mean.
First, we need to learn about Z-scores, which are related to the Standard Normal Distribution. The standard normal distribution is a normal distribution with a mean = 0 and a standard deviation = 1. We can convert any observation from normal distribution to it’s equivalent value from a standard normal distribution using:
\[ z_i = \frac{y_i - \mu}{\sigma} \]
These are sometimes called z-scores.
Using this formula, we can convert a sample mean, \(\bar{y}\), into its corresponding value from a standard normal using:
\[ z = \frac{\bar{y} - \mu}{SE} \]
where the denominator is the standard error of the mean (see above).
We can use this information to estimate how confident we are that are sample mean is a good representation of the true population mean. Again, we are doing this by taking advantage of the fact that we know the sample means are normally distributed. We calculate the range in which 95% of the sample mean values fall.
In mathematical terms, this looks like:
\[ P\{\bar{y} - 1.96 \times SE \leq \mu \leq \bar{y} + 1.96 \times SE \} = 0.95 \]
VERY IMPORTANT: The probability statement here is about the interval, not \(\mu\). \(\mu\) is not a random variable; it is fixed. What we are saying is that there is 95% confidence that this interval includes the population mean, \(\mu\).
As noted above, we rarely know \(\sigma\), so we cannot calculate \(SE\) exactly. So what can we do?
Instead, we use the sample standard deviation, \(s_{\bar{y}}\). So now we are dealing with a random variable that looks like:
\[ \frac{\bar{y}-\mu}{SE} \]
which is no longer standard normal distributed, but rather t distributed. We’ll learn more about this distribution later!
Let’s calculate the Confidence Interval for our estimate of mean height, using our very first sample of the yrbss data.
First, get the sample again.
set.seed(1981)
yrbss_samp <- yrbss[sample(x = 1:nrow(yrbss), replace = FALSE, size = 10), ]
Calculate our sample means.
mean_yrbss_samp <- mean(yrbss_samp$height, na.rm = TRUE)
Now, let’s calculate Standard Error.
se_yrbss_samp <- sd(yrbss_samp$height, na.rm = TRUE) / sqrt(nrow(yrbss_samp))
Now, calculate our 95% Confidence Interval.
CI_yrbss_upper <- mean_yrbss_samp + (1.96*se_yrbss_samp)
CI_yrbss_lower <- mean_yrbss_samp - (1.96*se_yrbss_samp)
Does our Confidence Interval contain the true population mean?
What happens to the Confidence Interval if you increase the sample size to 100?