We usually talk about probabilities in terms of events; the probability of event A occurring is written P(A). Probabilities can be between zero and one; if P(A) equals zero, then the event is impossible; if P(A) equals one, then the event is certain. - Quinn and Keough 2002
The relative frequency of an event over the long term, after many trials.
E.g., throwing a coin. What’s the probability of a ‘heads’? After one trial, two trials, …, 1000 trials.
(head <- runif(1) > 0.5)
## [1] FALSE
For now, all I will say is that there are other perspectives in statistics. We’ll cover a bit of Bayes at another time.
Open your bag(s) and count how many M&Ms you got. Also count how many of each color.
Make a new variable called tot_mm
and assign it the number of MMs in your bag.
tot_mm <- 60
mean
number of MMs in a bagCalculate the mean number of MMs in a bag.
tot_mm_pop_sample <- c(60, 58, 57, 63, 61, 60, 59)
Mean:
\[ \bar{x} = \frac{\sum_{i=1}^n{x_i}}{n} \]
Using the sum
function.
(mean_mm_1 <- sum(tot_mm_pop_sample) / length(tot_mm_pop_sample))
## [1] 59.71429
Using the mean
function.
(mean_mm_2 <- mean(tot_mm_pop_sample))
## [1] 59.71429
var
functionsd
functionWhat do we mean by variation?
In general, we mean something along the lines of ‘the amount of variation around some mean value’.
How might we quantify this?
First calculate distances from the mean for each bag.
(dist_from_mean <- tot_mm_pop_sample - mean_mm_1)
## [1] 0.2857143 -1.7142857 -2.7142857 3.2857143 1.2857143 0.2857143
## [7] -0.7142857
Next calculate the total distances. We could also call this the sums of distances
(sum_dist_from_mean <- sum(dist_from_mean))
## [1] -7.105427e-15
What information does this give us, and why might it not be useful?
First calculate the absolute distances from the mean.
(absdist_from_mean <- abs(tot_mm_pop_sample - mean_mm_1))
## [1] 0.2857143 1.7142857 2.7142857 3.2857143 1.2857143 0.2857143 0.7142857
Next add up the total absolute distances from the mean.
(sum_absdist_from_mean <- sum(absdist_from_mean))
## [1] 10.28571
While the absolute distance from the mean does give us a reasonable measure of variability, because of specific mathematical properties, it’s more convenient to work with squared distances from the mean, leading to the measures of variance and standard deviation.
Challenge
Calculate squared distances from the mean, and sum them to determine the total squared deviation.
Standard deviation can be thought of as the average deviation from the mean.
Challenge
Calcualte the variance of the number of M&Ms in a bag, considering all of the bags in the class. Do this without using the var
function.
Calculate the standard deviation of the number of M&Ms in a bag. Do this without using the sd
function.
Without looking, you chose one M&M. What colors could you have chosen?
The set of brown, yellow, green, red, orange, blue is the sample space.
Your selection of a single M&M is called an event.
Challenge
P(brown), P(yellow), etc.
data.frame
Here we will collate the classes data into a single data set.
Challenge
What is the mean probability of getting each color across each bag?
Tools to use:
data.frames
apply
P(green) + P(blue) + P(red)
\[ P(\sim Green) = 1 - P(Green) = P(G)^c \]
I draw one M&M from my bag, put it back, then draw another.
Challenge
What is the sample space when drawing two M&Ms?
Challenge
How would you create a random bag of M&Ms, assuming that each M&M has an equal probability of being in any given bag?
Mars claims that percentages of each color M&Ms are slightly different.
## Colors as a vector
mm_colors <- c("blue", "brown", "green", "orange", "red", "yellow")
mm_probs <- c(.23, .14, .16, .20, .13, .14)
## I want to "sample" a bag of MMs
new_bag <- sample(x = mm_colors, size = 15, replace = TRUE, prob = mm_probs)
table(new_bag)
## new_bag
## blue brown green orange yellow
## 2 1 2 7 3
Because it is possible to draw a bag that is completely missing some of the colors, we need to explicitely check how many of each color is in the new bag, if we want to compare with our original bag.
## Count the number of each color
new_bag_counts <- c(sum(new_bag == "blue"),
sum(new_bag == "brown"),
sum(new_bag == "green"),
sum(new_bag == "orange"),
sum(new_bag == "red"),
sum(new_bag == "yellow"))
new_bag_counts
## [1] 2 1 2 7 0 3
And now to check if my bag is the same as the new bag I can look at a series of logical tests.
individual_color_compare <- new_bag_counts == my_orig_bag
# The bags match if all of these are true
all(individual_color_compare)