Section Goals


Meeting preliminaries

We’ll be working with the Portal Survey data this week. We must first load these data into our work environment. Also, we’ll be focusing only on the Rodent taxa, so let’s filter the data to include only rodents

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
surveys <- read_csv("../data/portal_data_joined.csv")
## Rows: 34786 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): species_id, sex, genus, species, taxa, plot_type
## dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
surveys <-
  surveys %>%
  filter(taxa == "Rodent")

Probability

We usually talk about probabilities in terms of events; the probability of event A occurring is written P(A). Probabilities can be between zero and one; if P(A) equals zero, then the event is impossible; if P(A) equals one, then the event is certain. - Quinn and Keough 2002

Frequentist statistics

The relative frequency of an event over the long term, after many trials.

E.g., throwing a coin. What’s the probability of a ‘heads’? After one trial, two trials, …, 1000 trials.

Here’s an example of throwing a single coin.

(head <- runif(1) > 0.5)
## [1] TRUE

Bayesian statistics

For now, all I will say is that there are other perspectives in statistics. A key difference between Frequentist and Bayesian statistics is that the latter allows for you to incorporate prior knowledge.

Flipping a coin - frequentist vs bayesian perspective

You find a coin on the ground, and want to know the probability that you will get a heads if you flip the coin. A frequentist approach would flip the coin many times and assume the probability of getting a heads is approximately the frequency of heads. A Bayesian approach would assume that the probability is 0.5 based on a naive guess, then collect data (i.e., flip the coin) and update the prior probability using the newly acquired data.


Properties of probability

Law of Large Numbers

The tendency of \(\hat{p}_n\) to stabilize around \(p\) [as \(n\) increases] is described by the Law of Large Numbers. - Vu and Harrington 2020, p. 92

Example of law of large numbers

Let’s start by calculating the frequency of each of the genera captures in the Portal traps. These frequency values can be interpreted as probability values of catching a particular genus in a trap.

genus_frequencies <-
  surveys %>%
  group_by(genus) %>%
  count()

genus_frequencies$freq <- round(genus_frequencies$n / sum(genus_frequencies$n), 4)

knitr::kable(genus_frequencies)
genus n freq
Ammospermophilus 437 0.0128
Baiomys 46 0.0013
Chaetodipus 6029 0.1760
Dipodomys 16167 0.4721
Neotoma 1252 0.0366
Onychomys 3267 0.0954
Perognathus 1629 0.0476
Peromyscus 2234 0.0652
Reithrodontomys 2694 0.0787
Rodent 10 0.0003
Sigmodon 233 0.0068
Spermophilus 249 0.0073

The probabilities above are based on full information. What happens if we only has partial information?

First, how can we get a subset of the data?

# Pick the number of rows to select
rows_to_keep <- sample(1:nrow(surveys), size = 100, replace = FALSE)

surveys_subset <- surveys[rows_to_keep, ]

Next let’s calculate the frequencies / probabilities with the subset of data.

gen_freq_sub <-
  surveys_subset %>%
  group_by(genus) %>%
  count()

gen_freq_sub$freq <- round(gen_freq_sub$n / sum(gen_freq_sub$n), 4)

gen_freq_sub
## # A tibble: 8 × 3
## # Groups:   genus [8]
##   genus                n  freq
##   <chr>            <int> <dbl>
## 1 Ammospermophilus     4  0.04
## 2 Chaetodipus         16  0.16
## 3 Dipodomys           50  0.5 
## 4 Neotoma              1  0.01
## 5 Onychomys           12  0.12
## 6 Perognathus          5  0.05
## 7 Peromyscus           3  0.03
## 8 Reithrodontomys      9  0.09

Next, let’s play with the numbers of samples, and see what happens to the frequencies compared to the real frequencies.

# Pick the number of rows to select
rows_to_keep <- sample(1:nrow(surveys), size = 1000, replace = FALSE)

surveys_subset <- surveys[rows_to_keep, ]

gen_freq_sub <-
  surveys_subset %>%
  group_by(genus) %>%
  count()

gen_freq_sub$freq <- round(gen_freq_sub$n / sum(gen_freq_sub$n), 4)

gen_freq_sub
## # A tibble: 11 × 3
## # Groups:   genus [11]
##    genus                n  freq
##    <chr>            <int> <dbl>
##  1 Ammospermophilus     8 0.008
##  2 Baiomys              4 0.004
##  3 Chaetodipus        152 0.152
##  4 Dipodomys          483 0.483
##  5 Neotoma             30 0.03 
##  6 Onychomys           95 0.095
##  7 Perognathus         43 0.043
##  8 Peromyscus          78 0.078
##  9 Reithrodontomys     97 0.097
## 10 Sigmodon             6 0.006
## 11 Spermophilus         4 0.004

Aside: sampling with vs. without replacement

Above, we set replace = FALSE in the sample function. This means that any row that is selected, can only be selected once.

On the other hand, if we set replace = TRUE, then any row can be selected more than once by random.

# Pick the number of rows to select - WITH REPLACE = TRUE!
rows_to_keep <- sample(1:nrow(surveys), size = 1000, replace = TRUE)

surveys_subset <- surveys[rows_to_keep, ]

gen_freq_sub <-
  surveys_subset %>%
  group_by(genus) %>%
  count()

gen_freq_sub$freq <- round(gen_freq_sub$n / sum(gen_freq_sub$n), 4)

gen_freq_sub
## # A tibble: 11 × 3
## # Groups:   genus [11]
##    genus                n  freq
##    <chr>            <int> <dbl>
##  1 Ammospermophilus    15 0.015
##  2 Baiomys              1 0.001
##  3 Chaetodipus        184 0.184
##  4 Dipodomys          465 0.465
##  5 Neotoma             44 0.044
##  6 Onychomys           82 0.082
##  7 Perognathus         51 0.051
##  8 Peromyscus          65 0.065
##  9 Reithrodontomys     82 0.082
## 10 Sigmodon             8 0.008
## 11 Spermophilus         3 0.003

With a sample size of 1000, these values should be pretty similar. But there are reasons why you might choose replace = TRUE. For example, if you are projecting probabilities of capturing genera in the future, then treating past observations as a way to build generic distributions (using replace = TRUE) may be more appropriate.

Specific examples of calculating probabilities

  1. Probability of capturing a Neotoma

P(Neotoma) = 0.0366 = 3.6%

  1. Probability of capturing a Neotoma OR a Dipodomys

These are mutually exclusive events. You cannot be both a Neotoma and a Dipodomys.

P(Neotoma OR Dipodomys) = 0.0366 + 0.4721 = 0.5087

  1. Probability of capturing something other than Onychomys

This is called the complement of Onychomys

P(not Onychomys) = 1 - 0.0954 = 0.9046


Challenge - more with probability

  1. What is the probability of capturing a female rodent?

  2. What is the probability of capturing a female Dipodomys?

P(female AND Dipodomys)

  1. What is the probability of capturing a female rodent OR a Dipodomys?

Note - these are non-independent events. You need to use the General Addition Rule.

\[ P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B) \]

Independence

Two processes are independent if knowing the outcome of one provides no useful information about the outcome of the other. - Vu and Harrington 2020, p. 101


Challenge

  • Is flipping two coins different from flipping one, then the other?
  • You flip 99 heads. What is the probability you will flip a head on the 100th toss?
  • You catch a Neotoma in a trap in one of the Portal plots. What is the probability that you will catch a Neotoma in a trap in one of the other Portal plots?

Multiplication Rule for Independent Processes

From Vu and Harrington 2020, p. 103

If A and B represent events from two different and independent processes, then the probability that both A and B occur is given by:

\[ P(A \text{ and } B) = P(A) \times P(B) \]

Similarly, if there are \(k\) events \(A_1\), …, \(A_k\) from \(k\) independent processes, then the probability they all occur is:

\[ P(A_1)P(A_2)...P(A_k) \]


Challenge - probability of capturing 100 Dipodomys in a row vs capturing 100 Chaetodipus in a row.

First recall how we calculated the frequencies / probabilities of capturing each of the different genera of rodents. Then use the multiplication rule to calculate the two probabilities above.

genus_frequencies <-
  surveys %>%
  group_by(genus) %>%
  count()

genus_frequencies$freq <- round(genus_frequencies$n / sum(genus_frequencies$n), 4)

knitr::kable(genus_frequencies)
genus n freq
Ammospermophilus 437 0.0128
Baiomys 46 0.0013
Chaetodipus 6029 0.1760
Dipodomys 16167 0.4721
Neotoma 1252 0.0366
Onychomys 3267 0.0954
Perognathus 1629 0.0476
Peromyscus 2234 0.0652
Reithrodontomys 2694 0.0787
Rodent 10 0.0003
Sigmodon 233 0.0068
Spermophilus 249 0.0073

Conditional Probability

Conditional Probability: Intuition

Consider height in the US population.

What is the probability that a randomly selected individual in the population is taller than 6 feet, 4 inches?

  • Suppose you learn that the individual is a professional basketball player. Does this change the probability that the individual is taller than 6 feet, 4 inches?

Conditional Probability: Concept

The conditional probability of an event \(A\), given a second event \(B\), is the probability of \(A\) happening, knowing that the event \(B\) has happened.

  • This conditional probability is denoted \(P(A|B)\)

What is the probability of trapping a Dipodomys merriami?

Let \(A\) = capture a Dipodomys merriami

species_frequencies <-
  surveys %>%
  group_by(genus, species) %>%
  count()

species_frequencies <-
  species_frequencies %>%
  mutate(freq = round(n / sum(species_frequencies$n), 4))

What is the probability of trapping a Dipodomys merriami, conditional (or given) that you captures a Dipodomys?

Again, let \(A\) = capture a Dipodomys merriami, and now \(B\) = capture a Dipodomys.

dipo_frequencies <-
  surveys %>%
  filter(genus == "Dipodomys") %>%
  group_by(species) %>%
  count()

dipo_frequencies <-
  dipo_frequencies %>%
  mutate(freq = round(n / sum(dipo_frequencies$n), 4))

dipo_frequencies
## # A tibble: 4 × 3
## # Groups:   species [4]
##   species         n   freq
##   <chr>       <int>  <dbl>
## 1 merriami    10596 0.655 
## 2 ordii        3027 0.187 
## 3 sp.            40 0.0025
## 4 spectabilis  2504 0.155

Conditional Probability: Formal Definition

As long as \(P(B) > 0\), \[P(A|B) = \dfrac{P(A \cap B)}{P(B)}. \]

From the definition, \[\begin{align*} P(A|B) =& \dfrac{P(\text{capture a Dipodomys merriami AND capture a Dipodomys})}{P(\text{capture a Dipodomys})} \\ =& \dfrac{0.3094}{0.4721} = 0.6554 \end{align*}\]

Independence, Again…

A consequence of the definition of conditional probability:

  • If \(P(A|B) = P(A)\), then \(A\) and \(B\) are independent; knowing \(B\) offers no information about whether \(A\) occurred.

General Multiplication Rule

If \(A\) and \(B\) represent two outcomes or events, then \[P(A \cap B) = P(A|B)P(B).\]

This follows from rearranging the definition of conditional probability: \[P(A|B) = \frac{P(A \cap B)}{P(B)} \rightarrow P(A|B)P(B) = P(A \cap B)\]

Unlike the previously mentioned multiplication rule, this is valid for events that might not be independent.

Bayes’ Theorem, aka Bayes’ Rule

Bayes’ Thorem (simplest form):

\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

Follows directly from the definition of conditional probability by noting that \(P(A) P(B|A)\) equals \(P(A \text{ and } B)\):

\[P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B|A)P(A)}{P(B)}\]

Foundation of Bayesian statistics:

\[P(\text{assumptions}|\text{data}) = \frac{P(\text{data}|\text{assumptions})P(\text{assumptions})}{P(\text{data})}\]