The return of the red fox!
Probability distributions - helping us deal with uncertainty in biology
Important Statistical Distributions

The return of the red fox!

** What is the probability that a red fox wins x bouts**

Recall problem set 3 (i.e., the red fox versus arctic fox simulation). What if we wanted to know what the probability of a red fox winning 60 out of 100 bouts is? How might we figure that out? Let’s start by using a modified portion of the solution to problem set 3. Specifically, let’s simulate the number of bouts a red fox wins, given some number of total bouts.

Given the stated probability of a red fox winning, how many bouts do you think the red fox shoud win?

## Start a `for` loop, to replicate a known number of bouts

## Set the probability that a red fox wins the bout
prob_redfox_win <- 0.6

## Set the number of bouts
bouts <- 100 

## Initiate the number of redfox bout wins
redfox_bout_win <- 0


for(ind in 1:bouts){
  

  ## Draw a random number to simulate the outcome of the bout
  bout_outcome <- runif(1)
  
  ## Use an if statement to determine who won
  if(bout_outcome < prob_redfox_win){ # red fox wins
    ## Print the outcome to the console, so we know who won
    print("red fox wins")
    
    ## Increase the number of redfox_bout_win by 1
    redfox_bout_win <- redfox_bout_win + 1
    
  } else { # arctic fox wins
    ## Print the outcome to the console
    print("arctic fox wins")
    
    ## NOTE: We're only interested in the number of redfox wins, so 
    ## we're not going to bother keeping track of the arctic fox wins.
  }  
} # End the `for` loop

print(paste("Red fox won", redfox_bout_win, "bouts!"))

How does your guess from the point above compare to the results of the simulation?

Next, let’s run the simulation multiple times and monitor the different values for the number of red fox bout wins.

## Set the probability that a red fox wins the bout
prob_redfox_win <- 0.6

## Set the number of replicates
reps <- 100

## Initiate a vector to store the number of red fox wins for each replicate
redfox_bout_wins_vector <- c()

## Start a for loop to go through the desired number of replicates
for(rep in 1:reps){ 
  
  ## Set the number of bouts
  bouts <- 100 
  
  ## Initiate the number of redfox bout wins
  redfox_bout_win <- 0
  
  ## Start for loop to go through the number of bouts
  for(ind in 1:bouts){
    
    
    ## Draw a random number to simulate the outcome of the bout
    bout_outcome <- runif(1)
    
    ## Use an if statement to determine who won
    if(bout_outcome < prob_redfox_win){ # red fox wins
      ## Print the outcome to the console, so we know who won
      #print("red fox wins")
      
      ## Increase the number of redfox_bout_win by 1
      redfox_bout_win <- redfox_bout_win + 1
      
      } else { # arctic fox wins
        ## Print the outcome to the console
        #print("arctic fox wins")
        
        ## NOTE: We're only interested in the number of redfox wins, so 
        ## we're not going to bother keeping track of the arctic fox wins.
        }  
    } # End the `for` loop
  
  print(paste("Red fox won", redfox_bout_win, "bouts!"))
  
  ## Add the number of redfox_bout_win to redfox_bout_wins_vector
  redfox_bout_wins_vector <- c(redfox_bout_wins_vector, redfox_bout_win)
  
}

Challenge

How might we visualize the number of red fox bout wins?

library(ggplot2)

ggplot() +
  geom_histogram(data = NULL, 
                 aes(x = redfox_bout_wins_vector), 
                 binwidth = 1) +
  theme_bw()

Probability distribution of red fox wins - simulated

How might we visualize the probability for the numbers of red fox bout wins?

library(ggplot2)

ggplot() +
  geom_histogram(data = NULL, 
                 aes(x = redfox_bout_wins_vector, y = ..density.. ), 
                 binwidth = 1) +
  theme_bw()

Challenge

Using our last plot, estimate the probability of the red fox winning exactly 60 bouts.

Probability of 60 wins - analytically

Thinking about probability and counting (see Meeting 4), let’s determine the exact probability of the redfox winning exactly 60 bouts.

What’s the probability of a redfox winning 60 bouts, then losing 40?

\[ P(\text{red fox wins})^{60} * P(\text{red fox loses})^{40} \]

How many different ways can there be 60 wins?

Think about the result of each bout as an indpendent event. How many ways can you line up 100 indpendent events?

\[ n \cdot (n-1) \cdot (n-2) \cdot \text{...} \cdot 1 = n! \]

, where \(n = 100\).

Next, how many ways can you line up 60 wins and 40 losses?

\[ x! \cdot (n-x)! \]

, where \(x = 60\), or the number of wins.

Putting this all together we have the following:

\[ \frac{n!}{x! \cdot (n-x)!} P(\text{red fox wins})^{x} * P(\text{red fox loses})^{n-x} \]

Challange

Plug in the number using R, and calculate the probability of getting exactly 60 wins.

(factorial(100)/(factorial(60)*factorial(100-60))) * 0.6^60 * 0.4^(100-60)

## [1] 0.08121914

How does this compare to your estimate from above?

Binomial probability distribution

We just derived the bionomial probability distribution. In fact, the equation above is what’s called the probability density function for the Bionomial Distribution. The Binomial distribution can be thought of as the sum of \(n\) Bernoulli distributions (see below), all with the same parameterization (i.e., probability of success = \(p\)).
This is useful if we want to find the probability of getting a certain number of successes if your repeat some experiment many times.

\[ {pdf} = P(x | N,h) = \binom{N}{x} h^x (1-h)^{N-x} \]

Question: What is the number of successes, \(x\), in \(N\) trials, where the probability of a success is \(h\)

Bernoulli Distribution:

The Bernoulli distribution is a very simple distribution that can be used if we have a single event (or experiment) that has two possible outcomes, governed by some probability. For example, the probability of the outcome of a single coin toss with a fair coin can be described using a Bernoulli distribution.

\({pdf} = P(x) = p\) if \(x=1\) and \((1-p)\) if \(x=0\), where \(x\) is the event outcome (i.e., heads or tails)

Binomial distribution in R

Let’s look at R’s internal Binomial distribution function.

?rbinom()

new_data <- rbinom(n = 100, size = 100, prob = 0.6)

ggplot() +
  geom_histogram(data = NULL, 
                 aes(x = new_data, y = ..density.. ), 
                 binwidth = 1) +
  theme_bw()

Challenge

What happens to the historgram if you increase the observations, i.e. n?

Probability distributions - helping us deal with uncertainty in biology

Fact: there is a lot of uncertainty associated with biological data.

Uncertainty in biological data

There are two types of uncertainty:

Process uncertainty: A result of our imperfect knowledge of things. Example: we may get two different mean values for petal length for a particular Iris species if we go out to a field and measure two different sets of 50 flowers.
Observation uncertainty: Inaccuracies resulting during measurement. Example: our petal lengths values may be “off” if we used two different rulers, which were not exactly the same.

We will focus more on Process uncertainty, than on Observation uncertainty in this class. We try to understand uncertainty and uncertain outcomes by using probability distributions.

Histograms and density plots as probability distributions

Let’s go back to the plots of petal length from last class. Recall that when considering density plots the area under the curve or the area of the bins is equal to 1.

ggplot() +
  geom_histogram(data = iris, 
                 aes(x = Petal.Length, y = ..density.., fill = Species)) +
  facet_grid( Species ~ . ) +
  geom_density(data = iris, aes(x = Petal.Length, colour = Species)) +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Key idea: We can think of the area for a particular bin as the probability of getting a value that falls into that bin.

Emperical distributions versus defined probability distributions

There are many defined probability distributions that have specific properties. Above we derived the Binomial Distribution and were introduced to the Bernoulli Distribution. Some properties of distributions to keep in mind:

The area under the curve, or cumulative area of the bins is equal to 1
Different values of the variable described by a distribution are on the x-axis
The corresponding probability value for that particular variable value is on the y-axis (or expressed by the total area of the bin in a histogram plot). Note - this is not the case for continuous distributions, which we’ll discuss later.
We often want to focus on the probability distribution(s) that are related to our specific questions. Many of the most common distributions, though not all, can be interpreted as providing the answer to particular questions. Below I have outlined a number of distributions and the questions we might use them to answer.

Generic Distribution:

A probability distribution is simply a way to describe the probability that some event occurs, given a set of rules (i.e., a function). Mathematically we can write this as:

\[ P( a < x < b) = \int_a^b f(x|params)dx \]

Challange

Go back to our emperical probability distribution function for red fox bout wins. What’s the probability that the red fox wins between 58 and 62 bouts?

Important Statistical Distributions

Normal Distribution:

The normal distribution is perhaps the most widely used distribution in life science. It is also probably the most familiar.

The probability density function for the normal is:

\[ f(x|\mu,\sigma) ~ \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

Let’s dive into working with the Normal in R.

In R, there are several things you can do with all kinds of distributions (not just the normal). Start by looking for help on rnorm

?rnorm

Note that there is dnorm, pnorm, qnorm, and rnorm. For now, know that rnorm generates random variables from the distribution.

normal_rvs <- rnorm(n = 100, mean = 0, sd = 1)

ggplot() + 
  geom_histogram( data = NULL, aes(x = normal_rvs) ) +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Challenge

Increase the number of random varialbes drawn.
Change the mean and sd parameters.

Negative Binomial Distribution:

The coin flip problem from Problem 2 in Problem Set 2 was actually an example of simulating a process that is Negative Binomially distributed. This distribution is related to the Binomial distribution, but where as for the that distribution we set the number of trials, \(N\), as fixed, and asked about the number of successes, \(x\), in the Negative binomial we set the number of successes as fixed, \(x\), and ask how many trials, \(N\), does it takes to get that number of successes.

\[ {pdf} = P(N|x,h) = \binom{N-1}{x-1} h^x (1-h)^{N-x} \]

Question: What is the number of trials, \(N\), needed to reach the \(x\) th success.

head_count <- 0
tail_count <- 0

while(head_count <= 100){
  ifelse(test = runif(1) < 0.5, yes = (head_count <- head_count + 1), no = (tail_count <- tail_count + 1))
}

(flip_count <- head_count + tail_count)

## [1] 239

Geometric Distribution:

The Geometric distribution is a special case of the Negative Binomial, where we are asking specifically about the number of trials to reach the first success (i.e. the case of the negative binomial where \(x=1\).

\[ {pdf} = P(N|h) = h (1-h)^{N-1} \]

Question: What is the number of trials, \(N\), needed to reach the 1st success.

Poisson Distribution:

There are many questions we may ask that are related to the Poisson distribution. Usually we think of the Poisson when we have some process that usually results in a small number most of the times and produces larger numbers less frequently. Think about the number of eggs produced by some bird, or the number of off spring for some animal. By substituting the ususal \(\lambda\) value in the Poisson with \(\lambda T\), where \(T\) is some defined time period, and \(\lambda\) is some rate value, we can use the Poisson to address questions concerning the number of successes in some time period T.

\[ {pdf} = P(x|T,\lambda) = \frac{ e^{-(\lambda T)} (\lambda T)^x }{ x! } \]

Meeting 6 - Probability Distributions

Matthew E. Aiello-Lammens

The return of the red fox!

Challenge

Probability distribution of red fox wins - simulated

Challenge

Probability of 60 wins - analytically

Challange

Binomial probability distribution

Bernoulli Distribution:

Binomial distribution in R

Challenge

Probability distributions - helping us deal with uncertainty in biology

Uncertainty in biological data

Histograms and density plots as probability distributions

Emperical distributions versus defined probability distributions

Generic Distribution:

Challange

Important Statistical Distributions

Normal Distribution:

Challenge

Negative Binomial Distribution:

Geometric Distribution:

Poisson Distribution: