Refresh

What is a statistical distribution?

What are some of the distributions we talked about in the past?

What characteristics of distributions are we often interested in?

Other important distributions in data analysis

Standard Normal distribution

Recall that in Lecture 6 we were introduced to the Normal (or Gaussian) Distribution with pdf:

\[ f(x|\mu,\sigma) \thicksim \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

The Standard Normal distribution is a Normal distribution with \(\mu = 0\) and \(\sigma = 1\).

Challenge

Draw 1000 samples from the standard normal distribution and plot a histogram and density curve for the resulting data. After making this plot, try 10,000 samples.

stdnorm_samps <- data.frame( samps = rnorm(n = 10000, mean = 0, sd = 1) )

library(ggplot2)

ggplot(data = stdnorm_samps, aes(x = samps)) +
  geom_histogram( aes( y = ..density.. ) ) +
  geom_density(colour = "red", size = 2) + 
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • How well does the density plot represent the standard normal distribution?
  • Can we add the functional form of the normal distribution?
ggplot(data = stdnorm_samps, aes(x = samps)) +
  geom_histogram( aes( y = ..density.. ) ) +
  geom_density(colour = "red", size = 2) + 
  stat_function( fun = dnorm, colour = "blue", alpha = 0.6, size = 2) +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can convert from a normal distribution with any mean \(\mu\) and standard deviation \(\sigma\) to the standard normal by subtracting the mean from each value, then dividing by the standard deviation:

\[ z_i = \frac{y_i - \mu}{\sigma} \]

The utility of knowing the amount of area under the curve for probability densities

  • Where do 95% of the values from the standard normal fall? Introducing qnorm.
qnorm(.025)
## [1] -1.959964
qnorm(.975)
## [1] 1.959964
  • How might we visualize this?
ggplot(data = stdnorm_samps, aes(x = samps)) +
  coord_cartesian(xlim = c(-5, 5), ylim = c(0, 0.41)) +
  geom_vline(xintercept = c(-1.96, 1.96), size = 2 ) +
  stat_function( fun = dnorm, colour = "blue", size = 2) +
  theme_bw()

The Central Limit Theorem

This section gets into more nitty-gritty probability and statistics. You should review this information in both of the course text books.

Sampling distribution of the mean

The frequency distribution of the mean values of several samples taken from a population.

From Q&K p. 18:

The probability distribution of means of samples from a normal distribution is also normally distributed.

Challenge

  1. Take 100 samples of 10 observations from a normal distribution with a mean of 0 and a standard deviation of 1. Plot a histogram of the means.
  2. Increase the number of samples. How does this affect the outcome?
  3. Increase the number of observations. How does this affect the outcome?

CLT

In the limit as \(n \to \infty\), the probability distribution of means of samples from any distribution will approach a normal distribution.

Problem Set 6 - review

Plot a histogram of 100 observations that are Poisson distributed with a \(\lambda\) value \(= 1\). Describe the shape of the distribution. Does it look normal?

ggplot(data = NULL) +
  geom_histogram(aes(x = rpois(100, lambda = 1)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Next, take 100 samples of 10 observations from a Poisson distribution, calculate the mean for each sample. Plot a histogram of the means.

Using the CLT

From Q&K p. 18:

The expected value or mean of the probability distribution of sample means equals the mean of the population (\(\mu\)) from which the samples were taken.

Standard error of the sample mean

We just demonstrated that sample means are normally distributed. What can we do with this? To start, we can say something about the precision and accuracy of our estimates of the population mean based on the sample mean.

Standard error of the mean - the expected value of the standard deviation of the sample mean:

\[ \sigma_{\bar{y}} = \frac{\sigma}{\sqrt{n}} \]

Often times we only have one sample from the population. What can we say then?

Standard error of the mean:

\[ s_{\bar{y}} = \frac{s}{\sqrt{n}} \]

where \(s\) is the sample estimate of the standard deviation of the original population and \(n\) is the sample size.

NB: The standard error of the means “tells us about the error in using \(\bar{y}\) to estimate \(\mu\).”

Confidence intervals for the population mean

Recall that we can convert any observation from normal distribution to it’s equivalent value from a standard normal distribution using:

\[ z_i = \frac{y_i - \mu}{\sigma} \]

These are sometimes called z-scores.

Using this formula, we can convert a sample mean, \(\bar{y}\), into its corresponding value from a standard normal using:

\[ z = \frac{\bar{y} - \mu}{\sigma_{\bar{y}}} \]

where the denominator is the standard error of the mean (see above).

We can use this information to estimate how confident we are that are sample mean is a good representation of the true population mean. Again, we are doing this by taking advantage of the fact that we know the sample means are normally distributed. We calculate the range in which 95% of the sample mean values fall.

In mathematical terms, this looks like:

\[ P\{\bar{y} - 1.96\sigma_{\bar{y}} \leq \mu \leq \bar{y} + 1.96\sigma_{\bar{y}} \} = 0.95 \]

VERY IMPORTANT: The probability statement here is about the interval, not \(\mu\). \(\mu\) is not a random variable; it is fixed. What we are saying is that there is 95% confidence that this interval includes the population mean, \(\mu\).

Unknown \(\sigma\) (and \(\sigma_{\bar{y}}\))

We rarely know \(\sigma\), so we cannot calculate \(\sigma_{\bar{y}}\). So what can we do?

Instead, we use the sample standard deviation, \(s_{\bar{y}}\). So now we are dealing with a random variable that looks like:

\[ \frac{\bar{y}-\mu}{s_{\bar{y}}} \]

which is no longer standard normal distributed, but rather t distributed.

Challenge

Use whatever resource available to you, and look-up the t distribution. What is/are the parameter(s) of the t distribution?

Degrees of freedom

The number of observations in our sample that are “free to vary” when we are estimating the variance (Q&K p.20; Box 2.1).

If I have calculated the mean, how many observations, out of a total of n, are free to vary?

Answer: \(n-1\). Once I have the mean, I can use it and any \(n-1\) values to calculate the nth value.

General rule regarding df’s:

the df is the number of observations minus the number of parameters included in the formula for the variance (Q&K p.20; Box 2.1).

Standard error of other statistics

The standard error is … the standard deviation of the probability distribution of a specific statistics (Q&K p. 20).

Standard error of the sample variance

The distribution of the sample variance is not normal. It is \(\chi^2\) distributed.

Mathematically, this is explained as:

\[ X \thicksim N(0,1) \to X^2 \thicksim \chi^2 \]

Intuitively, why does this make sense?