** What is the probability that a red fox wins x bouts**
Recall problem set 3 (i.e., the red fox versus arctic fox simulation). What if we wanted to know what the probability of a red fox winning 60 out of 100 bouts is? How might we figure that out? Let’s start by using a modified portion of the solution to problem set 3. Specifically, let’s simulate the number of bouts a red fox wins, given some number of total bouts.
## Start a `for` loop, to replicate a known number of bouts
## Set the probability that a red fox wins the bout
prob_redfox_win <- 0.6
## Set the number of bouts
bouts <- 100
## Initiate the number of redfox bout wins
redfox_bout_win <- 0
for(ind in 1:bouts){
## Draw a random number to simulate the outcome of the bout
bout_outcome <- runif(1)
## Use an if statement to determine who won
if(bout_outcome < prob_redfox_win){ # red fox wins
## Print the outcome to the console, so we know who won
print("red fox wins")
## Increase the number of redfox_bout_win by 1
redfox_bout_win <- redfox_bout_win + 1
} else { # arctic fox wins
## Print the outcome to the console
print("arctic fox wins")
## NOTE: We're only interested in the number of redfox wins, so
## we're not going to bother keeping track of the arctic fox wins.
}
} # End the `for` loop
print(paste("Red fox won", redfox_bout_win, "bouts!"))
Next, let’s run the simulation multiple times and monitor the different values for the number of red fox bout wins.
## Set the probability that a red fox wins the bout
prob_redfox_win <- 0.6
## Set the number of replicates
reps <- 100
## Initiate a vector to store the number of red fox wins for each replicate
redfox_bout_wins_vector <- c()
## Start a for loop to go through the desired number of replicates
for(rep in 1:reps){
## Set the number of bouts
bouts <- 100
## Initiate the number of redfox bout wins
redfox_bout_win <- 0
## Start for loop to go through the number of bouts
for(ind in 1:bouts){
## Draw a random number to simulate the outcome of the bout
bout_outcome <- runif(1)
## Use an if statement to determine who won
if(bout_outcome < prob_redfox_win){ # red fox wins
## Print the outcome to the console, so we know who won
#print("red fox wins")
## Increase the number of redfox_bout_win by 1
redfox_bout_win <- redfox_bout_win + 1
} else { # arctic fox wins
## Print the outcome to the console
#print("arctic fox wins")
## NOTE: We're only interested in the number of redfox wins, so
## we're not going to bother keeping track of the arctic fox wins.
}
} # End the `for` loop
print(paste("Red fox won", redfox_bout_win, "bouts!"))
## Add the number of redfox_bout_win to redfox_bout_wins_vector
redfox_bout_wins_vector <- c(redfox_bout_wins_vector, redfox_bout_win)
}
How might we visualize the number of red fox bout wins?
library(ggplot2)
ggplot() +
geom_histogram(data = NULL,
aes(x = redfox_bout_wins_vector),
binwidth = 1) +
theme_bw()
How might we visualize the probability for the numbers of red fox bout wins?
library(ggplot2)
ggplot() +
geom_histogram(data = NULL,
aes(x = redfox_bout_wins_vector, y = ..density.. ),
binwidth = 1) +
theme_bw()
Using our last plot, estimate the probability of the red fox winning exactly 60 bouts.
Thinking about probability and counting (see Meeting 4), let’s determine the exact probability of the redfox winning exactly 60 bouts.
\[ P(\text{red fox wins})^{60} * P(\text{red fox loses})^{40} \]
Think about the result of each bout as an indpendent event. How many ways can you line up 100 indpendent events?
\[ n \cdot (n-1) \cdot (n-2) \cdot \text{...} \cdot 1 = n! \]
, where \(n = 100\).
Next, how many ways can you line up 60 wins and 40 losses?
\[ x! \cdot (n-x)! \]
, where \(x = 60\), or the number of wins.
\[ \frac{n!}{x! \cdot (n-x)!} P(\text{red fox wins})^{x} * P(\text{red fox loses})^{n-x} \]
(factorial(100)/(factorial(60)*factorial(100-60))) * 0.6^60 * 0.4^(100-60)
## [1] 0.08121914
We just derived the bionomial probability distribution. In fact, the equation above is what’s called the probability density function for the Bionomial Distribution. The Binomial distribution can be thought of as the sum of \(n\) Bernoulli distributions (see below), all with the same parameterization (i.e., probability of success = \(p\)).
This is useful if we want to find the probability of getting a certain number of successes if your repeat some experiment many times.
\[ {pdf} = P(x | N,h) = \binom{N}{x} h^x (1-h)^{N-x} \]
Question: What is the number of successes, \(x\), in \(N\) trials, where the probability of a success is \(h\)
The Bernoulli distribution is a very simple distribution that can be used if we have a single event (or experiment) that has two possible outcomes, governed by some probability. For example, the probability of the outcome of a single coin toss with a fair coin can be described using a Bernoulli distribution.
\({pdf} = P(x) = p\) if \(x=1\) and \((1-p)\) if \(x=0\), where \(x\) is the event outcome (i.e., heads or tails)
Let’s look at R’s internal Binomial distribution function.
?rbinom()
new_data <- rbinom(n = 100, size = 100, prob = 0.6)
ggplot() +
geom_histogram(data = NULL,
aes(x = new_data, y = ..density.. ),
binwidth = 1) +
theme_bw()
What happens to the historgram if you increase the observations, i.e. n
?
Fact: there is a lot of uncertainty associated with biological data.
There are two types of uncertainty:
Process uncertainty: A result of our imperfect knowledge of things. Example: we may get two different mean values for petal length for a particular Iris species if we go out to a field and measure two different sets of 50 flowers.
Observation uncertainty: Inaccuracies resulting during measurement. Example: our petal lengths values may be “off” if we used two different rulers, which were not exactly the same.
We will focus more on Process uncertainty, than on Observation uncertainty in this class. We try to understand uncertainty and uncertain outcomes by using probability distributions.
Let’s go back to the plots of petal length from last class. Recall that when considering density plots the area under the curve or the area of the bins is equal to 1.
ggplot() +
geom_histogram(data = iris,
aes(x = Petal.Length, y = ..density.., fill = Species)) +
facet_grid( Species ~ . ) +
geom_density(data = iris, aes(x = Petal.Length, colour = Species)) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Key idea: We can think of the area for a particular bin as the probability of getting a value that falls into that bin.
There are many defined probability distributions that have specific properties. Above we derived the Binomial Distribution and were introduced to the Bernoulli Distribution. Some properties of distributions to keep in mind:
A probability distribution is simply a way to describe the probability that some event occurs, given a set of rules (i.e., a function). Mathematically we can write this as:
\[ P( a < x < b) = \int_a^b f(x|params)dx \]
Go back to our emperical probability distribution function for red fox bout wins. What’s the probability that the red fox wins between 58 and 62 bouts?
The normal distribution is perhaps the most widely used distribution in life science. It is also probably the most familiar.
The probability density function for the normal is:
\[ f(x|\mu,\sigma) ~ \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]
Let’s dive into working with the Normal in R.
In R, there are several things you can do with all kinds of distributions (not just the normal). Start by looking for help on rnorm
?rnorm
Note that there is dnorm
, pnorm
, qnorm
, and rnorm
. For now, know that rnorm
generates random variables from the distribution.
normal_rvs <- rnorm(n = 100, mean = 0, sd = 1)
ggplot() +
geom_histogram( data = NULL, aes(x = normal_rvs) ) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
mean
and sd
parameters.The coin flip problem from Problem 2 in Problem Set 2 was actually an example of simulating a process that is Negative Binomially distributed. This distribution is related to the Binomial distribution, but where as for the that distribution we set the number of trials, \(N\), as fixed, and asked about the number of successes, \(x\), in the Negative binomial we set the number of successes as fixed, \(x\), and ask how many trials, \(N\), does it takes to get that number of successes.
\[ {pdf} = P(N|x,h) = \binom{N-1}{x-1} h^x (1-h)^{N-x} \]
Question: What is the number of trials, \(N\), needed to reach the \(x\) th success.
head_count <- 0
tail_count <- 0
while(head_count <= 100){
ifelse(test = runif(1) < 0.5, yes = (head_count <- head_count + 1), no = (tail_count <- tail_count + 1))
}
(flip_count <- head_count + tail_count)
## [1] 239
The Geometric distribution is a special case of the Negative Binomial, where we are asking specifically about the number of trials to reach the first success (i.e. the case of the negative binomial where \(x=1\).
\[ {pdf} = P(N|h) = h (1-h)^{N-1} \]
Question: What is the number of trials, \(N\), needed to reach the 1st success.
There are many questions we may ask that are related to the Poisson distribution. Usually we think of the Poisson when we have some process that usually results in a small number most of the times and produces larger numbers less frequently. Think about the number of eggs produced by some bird, or the number of off spring for some animal. By substituting the ususal \(\lambda\) value in the Poisson with \(\lambda T\), where \(T\) is some defined time period, and \(\lambda\) is some rate value, we can use the Poisson to address questions concerning the number of successes in some time period T.
\[ {pdf} = P(x|T,\lambda) = \frac{ e^{-(\lambda T)} (\lambda T)^x }{ x! } \]