What is a statistical distribution?
What are some of the distributions we talked about in the past?
What characteristics of distributions are we often interested in?
Recall that in Lecture 6 we were introduced to the Normal (or Gaussian) Distribution with pdf:
\[ f(x|\mu,\sigma) \thicksim \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]
The Standard Normal distribution is a Normal distribution with \(\mu = 0\) and \(\sigma = 1\).
Draw 1000 samples from the standard normal distribution and plot a histogram and density curve for the resulting data. After making this plot, try 10,000 samples.
stdnorm_samps <- data.frame( samps = rnorm(n = 10000, mean = 0, sd = 1) )
library(ggplot2)
ggplot(data = stdnorm_samps, aes(x = samps)) +
geom_histogram( aes( y = ..density.. ) ) +
geom_density(colour = "red", size = 2) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = stdnorm_samps, aes(x = samps)) +
geom_histogram( aes( y = ..density.. ) ) +
geom_density(colour = "red", size = 2) +
stat_function( fun = dnorm, colour = "blue", alpha = 0.6, size = 2) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can convert from a normal distribution with any mean \(\mu\) and standard deviation \(\sigma\) to the standard normal by subtracting the mean from each value, then dividing by the standard deviation:
\[ z_i = \frac{y_i - \mu}{\sigma} \]
qnorm
.qnorm(.025)
## [1] -1.959964
qnorm(.975)
## [1] 1.959964
ggplot(data = stdnorm_samps, aes(x = samps)) +
coord_cartesian(xlim = c(-5, 5), ylim = c(0, 0.41)) +
geom_vline(xintercept = c(-1.96, 1.96), size = 2 ) +
stat_function( fun = dnorm, colour = "blue", size = 2) +
theme_bw()
This section gets into more nitty-gritty probability and statistics. You should review this information in both of the course text books.
The frequency distribution of the mean values of several samples taken from a population.
From Q&K p. 18:
The probability distribution of means of samples from a normal distribution is also normally distributed.
In the limit as \(n \to \infty\), the probability distribution of means of samples from any distribution will approach a normal distribution.
Plot a histogram of 100 observations that are Poisson distributed with a \(\lambda\) value \(= 1\). Describe the shape of the distribution. Does it look normal?
ggplot(data = NULL) +
geom_histogram(aes(x = rpois(100, lambda = 1)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Next, take 100 samples of 10 observations from a Poisson distribution, calculate the mean for each sample. Plot a histogram of the means.
From Q&K p. 18:
The expected value or mean of the probability distribution of sample means equals the mean of the population (\(\mu\)) from which the samples were taken.
We just demonstrated that sample means are normally distributed. What can we do with this? To start, we can say something about the precision and accuracy of our estimates of the population mean based on the sample mean.
Standard error of the mean - the expected value of the standard deviation of the sample mean:
\[ \sigma_{\bar{y}} = \frac{\sigma}{\sqrt{n}} \]
Often times we only have one sample from the population. What can we say then?
Standard error of the mean:
\[ s_{\bar{y}} = \frac{s}{\sqrt{n}} \]
where \(s\) is the sample estimate of the standard deviation of the original population and \(n\) is the sample size.
NB: The standard error of the means “tells us about the error in using \(\bar{y}\) to estimate \(\mu\).”
Recall that we can convert any observation from normal distribution to it’s equivalent value from a standard normal distribution using:
\[ z_i = \frac{y_i - \mu}{\sigma} \]
These are sometimes called z-scores.
Using this formula, we can convert a sample mean, \(\bar{y}\), into its corresponding value from a standard normal using:
\[ z = \frac{\bar{y} - \mu}{\sigma_{\bar{y}}} \]
where the denominator is the standard error of the mean (see above).
We can use this information to estimate how confident we are that are sample mean is a good representation of the true population mean. Again, we are doing this by taking advantage of the fact that we know the sample means are normally distributed. We calculate the range in which 95% of the sample mean values fall.
In mathematical terms, this looks like:
\[ P\{\bar{y} - 1.96\sigma_{\bar{y}} \leq \mu \leq \bar{y} + 1.96\sigma_{\bar{y}} \} = 0.95 \]
VERY IMPORTANT: The probability statement here is about the interval, not \(\mu\). \(\mu\) is not a random variable; it is fixed. What we are saying is that there is 95% confidence that this interval includes the population mean, \(\mu\).
We rarely know \(\sigma\), so we cannot calculate \(\sigma_{\bar{y}}\). So what can we do?
Instead, we use the sample standard deviation, \(s_{\bar{y}}\). So now we are dealing with a random variable that looks like:
\[ \frac{\bar{y}-\mu}{s_{\bar{y}}} \]
which is no longer standard normal distributed, but rather t distributed.
Use whatever resource available to you, and look-up the t distribution. What is/are the parameter(s) of the t distribution?
The number of observations in our sample that are “free to vary” when we are estimating the variance (Q&K p.20; Box 2.1).
If I have calculated the mean, how many observations, out of a total of n, are free to vary?
Answer: \(n-1\). Once I have the mean, I can use it and any \(n-1\) values to calculate the nth value.
General rule regarding df’s:
the df is the number of observations minus the number of parameters included in the formula for the variance (Q&K p.20; Box 2.1).
The standard error is … the standard deviation of the probability distribution of a specific statistics (Q&K p. 20).
The distribution of the sample variance is not normal. It is \(\chi^2\) distributed.
Mathematically, this is explained as:
\[ X \thicksim N(0,1) \to X^2 \thicksim \chi^2 \]
Intuitively, why does this make sense?