Section Goals


Meeting preliminaries

We’ll be working with the Portal Survey data this week (again). We must first load these data into our work environment. Also, we’ll be focusing only on the Rodent taxa, so let’s filter the data to include only rodents

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
surveys <- read_csv("../data/portal_data_joined.csv")
## Rows: 34786 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): species_id, sex, genus, species, taxa, plot_type
## dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
surveys <-
  surveys %>%
  filter(taxa == "Rodent")

Challenge - Probability review

  • What are the three main properties of probability (as we’ve discussed in this class)?
  • Define what we mean by mutually exclusive events
  • Describe the Law of Large Numbers and discuss how it relates to the frequency of an event
  • In what context would you use the General Addition Rule? Note - the General Addition Rule is \(P(A\text{ or } B ) = P(A) + P(B) - P(A\text{ and }B)\)

Random Variables

A random variable is a function that maps each event in a sample space to a number.

Suppose \(X\) is the number of heads in 3 tosses of a fair coin.

3 coin tosses
3 coin tosses

Distribution of a Discrete Random Variable

The distribution of a discrete random variable is the collection of its values and the probabilities associated with those values.

The probability distribution for \(X\) is as follows:

\(x_i\) 0 1 2 3
\(P(X = x_i)\) 1/8 3/8 3/8 1/8

\[\sum_{x=0}^3 P(X = x_i) = 1 \]


Histograms and density plots as probability distributions

We can look at a density plot of the frequency of the different genera, which would be a probability distribution for genera type.

ggplot(data = surveys, aes(x = genus, y = ..prop.., group = 1)) + 
  geom_bar() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90)) 
## Warning: The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(prop)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Key idea: We can think of the area for a particular bin as the probability of getting a value that falls into that bin.

Emperical distributions versus defined probability distributions

There are many defined probability distributions that have specific properties. In the online lecture you learned about the binomial, Poisson, and normal distributions. Some properties of distributions to keep in mind:

  • The area under the curve, or cumulative area of the bins is equal to 1
  • Different values of the variable described by a distribution are on the x-axis
  • The corresponding probability value for that particular variable value is on the y-axis (or expressed by the total area of the bin in a histogram plot). Note - this is not the case for continuous distributions, which we’ll discuss later.
  • We often want to focus on the probability distribution(s) that are related to our specific questions. Many of the most common distributions, though not all, can be interpreted as providing the answer to particular questions. Below I have outlined a number of distributions and the questions we might use them to answer.

Binomial Distribution

The Binomial distribution can be thought of as the sum of \(n\) Bernoulli distributions (see below), all with the same parameterization (i.e., probability of success = \(p\)).
This is useful if we want to find the probability of getting a certain number of successes if your repeat some experiment many times.

\[ {pdf} = P(x | N,h) = \binom{N}{x} h^x (1-h)^{N-x} \]

Question: What is the number of successes, \(x\), in \(N\) trials, where the probability of a success is \(h\)

Binomial Distribution shiny app

https://istats.shinyapps.io/BinomialDist/

Bernoulli Distribution:

The Bernoulli distribution is a very simple distribution that can be used if we have a single event (or experiment) that has two possible outcomes, governed by some probability. For example, the probability of the outcome of a single coin toss with a fair coin can be described using a Bernoulli distribution.

\({pdf} = P(x) = p\) if \(x=1\) and \((1-p)\) if \(x=0\), where \(x\) is the event outcome (i.e., heads or tails)

Binomial example with Portal data

Let’s consider the probability of trapping a Dipodomys versus not a Dipodomys.

P(trap a Dipodomys) = 0.4721

Now, let’s look at R’s internal Binomial distribution function.

rbinom(n = 1, size = 100, prob = 0.4721)
## [1] 55
new_data <- rbinom(n = 100, size = 100, prob = 0.4721)

ggplot() +
  geom_histogram(data = NULL, 
                 aes(x = new_data, y = ..density.. ), 
                 binwidth = 1) +
  theme_bw()

Let’s look at this empirically from our data.

How many Dipodomys trappings do we get if we randomly selected 100 of the trapping events?

# select 100 random rows
rand_rows <- sample(1:nrow(surveys), size = 100)

# grab these rows as a subset of the surveys data set
surveys_subset <- surveys[rand_rows, ]

# count the number of dipodomys
sum(surveys_subset$genus == "Dipodomys")
## [1] 46

This is equivalent to a single draw from a Binomial distribution.

Next, let’s try multiple draws from our data set.

# Create an empty vector of dipo_counts
dipo_counts <- c()

for(x in 1:100){
  # select 100 random rows
  rand_rows <- sample(1:nrow(surveys), size = 100)
  
  # grab these rows as a subset of the surveys data set
  surveys_subset <- surveys[rand_rows, ]
  
  # count the number of dipodomys and add this count to 
  # our dipo_counts vector
  dipo_counts <- c(dipo_counts, sum(surveys_subset$genus == "Dipodomys"))
  
}

Now, let’s make a histogram of the dipo_counts data.

ggplot() +
  geom_histogram(data = NULL, 
                 aes(x = dipo_counts, y = ..density.. ), 
                 binwidth = 1) +
  theme_bw()


Challenge - changing n and / or sample size

Using the code above, experiment with what happens when you change values for n and / or sample size. In this case, n was equal to the number of iterations of our for loop and sample size was equal to the number of random rows sampled from the overall data set.


Normal Distribution

The normal distribution is perhaps the most widely used distribution in life science. It is also probably the most familiar.

The probability density function for the normal is:

\[ f(x|\mu,\sigma) \sim \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

Let’s dive into working with the Normal in R.

In R, there are several things you can do with all kinds of distributions (not just the normal). Start by looking for help on rnorm

?rnorm

Note that there is dnorm, pnorm, qnorm, and rnorm. For now, know that rnorm generates random variables from the distribution.

normal_rvs <- rnorm(n = 100, mean = 0, sd = 1)

ggplot() + 
  geom_histogram( data = NULL, aes(x = normal_rvs) ) +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Challenge

  1. Increase the number of random variables drawn.
  2. Change the mean and sd parameters.

Normal Distribution shiny app

https://istats.shinyapps.io/NormalDist/

Standard Normal distribution

The Standard Normal distribution is a Normal distribution with \(\mu = 0\) and \(\sigma = 1\).

Normal example with Portal data

Let’s look at the distribution of the weight of captured Chaetodipus.

chaetodipus <-
  surveys %>%
  filter(genus == "Chaetodipus")

ggplot(data = chaetodipus, aes(x = weight, y = ..density..)) +
  geom_histogram() +
  geom_density()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_density()`).

Data transformation

The observed distribution does not appear to be normally distributed. But we can change the scale of the values so they appear more normal-like. The most common transformation for this is a log transformation, either a \(log_e\) or \(log_{10}\) transformation.

chaetodipus$log_weight <- log10(chaetodipus$weight)

ggplot(data = chaetodipus, aes(x = log_weight, y = ..density..)) +
  geom_histogram() +
  geom_density()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_density()`).

OK. Much closer. But there does appear to be some bimodality in the distribution. Why? Perhaps this is due to sex differences.

ggplot(data = chaetodipus, aes(x = log_weight, y = ..density.., fill = sex)) +
  geom_histogram(position = "dodge") +
  geom_density(alpha = 0.3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_density()`).

There appears to be some differences between sexes, particularly larger females. But there does appear to be two modes even for each sex.

Let’s push on and compare our observations to an assumed normal distribution of weights. To parameterize our distribution, we need to calculate the mean log_weight and standard deviation of log_weight.

mean_chaeto_logweight <- mean(chaetodipus$log_weight, na.rm = TRUE)
sd_chaeto_logweight <- sd(chaetodipus$log_weight, na.rm = TRUE)

ggplot(data = chaetodipus) +
  geom_histogram(aes(x = log_weight, y = ..density..)) +
  geom_density(aes(x = log_weight)) +
  stat_function(fun = dnorm, 
                args = list(mean = mean_chaeto_logweight, sd = sd_chaeto_logweight),
                colour = "blue", alpha = 0.6, size = 2) 
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_density()`).

OK, now let’s calculate some frequencies and probabilities, using both the observed data and assuming a normal distribution.

  • What’s the probability of trapping a Chaetodipus greater than the average log(weight), 1.35?

Based on observations - count the number of Chaetodipus trapped that are greater than 1.35 and divide that number by the total number of Chaetodipus trapped.

sum(chaetodipus$log_weight > mean_chaeto_logweight, na.rm = TRUE) / nrow(chaetodipus)
## [1] 0.4572898

Based on assumption of normal distribution - use R’s internal functions.

pnorm(q = mean_chaeto_logweight, 
      mean = mean_chaeto_logweight,
      sd = sd_chaeto_logweight, 
      lower.tail = FALSE)
## [1] 0.5

Challenge - calculate, both ways, the probability of catching a Chaetodipus whose log(weight) is less than 1.


Poisson Distribution

There are many questions we may ask that are related to the Poisson distribution. Usually we think of the Poisson when we have some process that usually results in a small number most of the times and produces larger numbers less frequently. Think about the number of eggs produced by some bird, or the number of off spring for some animal. By substituting the usual \(\lambda\) value in the Poisson with \(\lambda T\), where \(T\) is some defined time period, and \(\lambda\) is some rate value, we can use the Poisson to address questions concerning the number of successes in some time period T.

\[ {pdf} = P(x|T,\lambda) = \frac{ e^{-(\lambda T)} (\lambda T)^x }{ x! } \]

Poisson Distribution shiny app

https://istats.shinyapps.io/PoissonDist/

Case study - rate of catching rock pocket mice (Chaetodipus intermedius)

Calculate the rates (number of captures per year) for all of the species.

surveys %>%
  group_by(genus, species, year) %>%
  count() %>%
  mutate(genus_species = paste(genus, species)) %>%
  group_by(genus_species) %>%
  summarise(avg_cnt = mean(n)) %>%
  arrange(avg_cnt)
## # A tibble: 29 × 2
##    genus_species             avg_cnt
##    <chr>                       <dbl>
##  1 Reithrodontomys sp.          1   
##  2 Spermophilus tereticaudus    1   
##  3 Rodent sp.                   1.43
##  4 Chaetodipus sp.              1.5 
##  5 Onychomys sp.                2   
##  6 Chaetodipus intermedius      2.25
##  7 Dipodomys sp.                2.67
##  8 Perognathus hispidus         3.56
##  9 Reithrodontomys montanus     4   
## 10 Sigmodon fulviventer         7.17
## # ℹ 19 more rows

What is the probability that we will capture exactly 5 rock pocket mice in the coming year’s survey?

dpois(x = 5, lambda = 2.25)
## [1] 0.05064875

What is the probability that we will capture 5 or more rock pocket mice?

ppois(q = 4, lambda = 2.25, lower.tail = FALSE)
## [1] 0.07801411