Probability

We usually talk about probabilities in terms of events; the probability of event A occurring is written P(A). Probabilities can be between zero and one; if P(A) equals zero, then the event is impossible; if P(A) equals one, then the event is certain. - Quinn and Keough 2002

Frequentist statistics

The relative frequency of an event over the long term, after many trials.

E.g., throwing a coin. What’s the probability of a ‘heads’? After one trial, two trials, …, 1000 trials.

(head <- runif(1) > 0.5)
## [1] FALSE

Bayesian statistics

For now, all I will say is that there are other perspectives in statistics. We’ll cover a bit of Bayes at another time.


Probability through M&Ms

Open your bag(s) and count how many M&Ms you got. Also count how many of each color.

Make a new variable called tot_mm and assign it the number of MMs in your bag.

tot_mm <- 60

mean number of MMs in a bag

Calculate the mean number of MMs in a bag.

tot_mm_pop_sample <- c(60, 58, 57, 63, 61, 60, 59)

Mean:

\[ \bar{x} = \frac{\sum_{i=1}^n{x_i}}{n} \]

Using the sum function.

(mean_mm_1 <- sum(tot_mm_pop_sample) / length(tot_mm_pop_sample))
## [1] 59.71429

Using the mean function.

(mean_mm_2 <- mean(tot_mm_pop_sample))
## [1] 59.71429

Quantify the variation in the number of MMs in a bag

  • var function
  • sd function

What do we mean by variation?

In general, we mean something along the lines of ‘the amount of variation around some mean value’.

How might we quantify this?

  • Add up the distances from the mean.

First calculate distances from the mean for each bag.

(dist_from_mean <- tot_mm_pop_sample - mean_mm_1)
## [1]  0.2857143 -1.7142857 -2.7142857  3.2857143  1.2857143  0.2857143
## [7] -0.7142857

Next calculate the total distances. We could also call this the sums of distances

(sum_dist_from_mean <- sum(dist_from_mean))
## [1] -7.105427e-15

What information does this give us, and why might it not be useful?

  • Add up the absolute distances form the mean.

First calculate the absolute distances from the mean.

(absdist_from_mean <- abs(tot_mm_pop_sample - mean_mm_1))
## [1] 0.2857143 1.7142857 2.7142857 3.2857143 1.2857143 0.2857143 0.7142857

Next add up the total absolute distances from the mean.

(sum_absdist_from_mean <- sum(absdist_from_mean))
## [1] 10.28571
  • Add up the squared distances from the mean.

While the absolute distance from the mean does give us a reasonable measure of variability, because of specific mathematical properties, it’s more convenient to work with squared distances from the mean, leading to the measures of variance and standard deviation.

Challenge

Calculate squared distances from the mean, and sum them to determine the total squared deviation.

Standard deviation can be thought of as the average deviation from the mean.

Challenge

  1. Calcualte the variance of the number of M&Ms in a bag, considering all of the bags in the class. Do this without using the var function.

  2. Calculate the standard deviation of the number of M&Ms in a bag. Do this without using the sd function.

Probability and M&Ms

Without looking, you chose one M&M. What colors could you have chosen?

The set of brown, yellow, green, red, orange, blue is the sample space.

Your selection of a single M&M is called an event.

Challenge

  • What’s the probability of getting a “color” M&M in your bag?

P(brown), P(yellow), etc.

Putting our data together into a data.frame

Here we will collate the classes data into a single data set.

Challenge

What is the mean probability of getting each color across each bag?

Tools to use:

  • data.frames
  • apply

Probability of events

  • What’s the P(green OR blue OR red) in your bag?

P(green) + P(blue) + P(red)

  • What is the P(NOT green) in your bag?

\[ P(\sim Green) = 1 - P(Green) = P(G)^c \]

Sampling with replacement

I draw one M&M from my bag, put it back, then draw another.

  • What’s the P(green and then blue)?
  • What’s the P(green or blue)?

Sampling without replacement

  • What is the probability of getting green, then blue, without replacing your first draw?

Challenge

What is the sample space when drawing two M&Ms?

Draw a random bag of M&Ms using the company stated frequencies/probabilities for each color

Challenge

How would you create a random bag of M&Ms, assuming that each M&M has an equal probability of being in any given bag?

Mars claims that percentages of each color M&Ms are slightly different.

## Colors as a vector
mm_colors <- c("blue", "brown", "green", "orange", "red", "yellow")
mm_probs <- c(.23, .14, .16, .20, .13, .14)

## I want to "sample" a bag of MMs
new_bag <- sample(x = mm_colors, size = 15, replace = TRUE, prob = mm_probs)
table(new_bag)
## new_bag
##   blue  brown  green orange yellow 
##      2      1      2      7      3

Because it is possible to draw a bag that is completely missing some of the colors, we need to explicitely check how many of each color is in the new bag, if we want to compare with our original bag.

## Count the number of each color
new_bag_counts <- c(sum(new_bag == "blue"),
                    sum(new_bag == "brown"),
                    sum(new_bag == "green"),
                    sum(new_bag == "orange"),
                    sum(new_bag == "red"),
                    sum(new_bag == "yellow"))
new_bag_counts
## [1] 2 1 2 7 0 3

And now to check if my bag is the same as the new bag I can look at a series of logical tests.

individual_color_compare <- new_bag_counts == my_orig_bag

# The bags match if all of these are true
all(individual_color_compare)