We’ll be working with the Portal Survey data this week. We must first load these data into our work environment. Also, we’ll be focusing only on the Rodent taxa, so let’s filter the data to include only rodents
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
surveys <- read_csv("../data/portal_data_joined.csv")
## Rows: 34786 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): species_id, sex, genus, species, taxa, plot_type
## dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
surveys <-
surveys %>%
filter(taxa == "Rodent")
We usually talk about probabilities in terms of events; the probability of event A occurring is written P(A). Probabilities can be between zero and one; if P(A) equals zero, then the event is impossible; if P(A) equals one, then the event is certain. - Quinn and Keough 2002
The relative frequency of an event over the long term, after many trials.
E.g., throwing a coin. What’s the probability of a ‘heads’? After one trial, two trials, …, 1000 trials.
Here’s an example of throwing a single coin.
(head <- runif(1) > 0.5)
## [1] TRUE
For now, all I will say is that there are other perspectives in statistics. A key difference between Frequentist and Bayesian statistics is that the latter allows for you to incorporate prior knowledge.
You find a coin on the ground, and want to know the probability that you will get a heads if you flip the coin. A frequentist approach would flip the coin many times and assume the probability of getting a heads is approximately the frequency of heads. A Bayesian approach would assume that the probability is 0.5 based on a naive guess, then collect data (i.e., flip the coin) and update the prior probability using the newly acquired data.
The tendency of \(\hat{p}_n\) to stabilize around \(p\) [as \(n\) increases] is described by the Law of Large Numbers. - Vu and Harrington 2020, p. 92
Let’s start by calculating the frequency of each of the genera captures in the Portal traps. These frequency values can be interpreted as probability values of catching a particular genus in a trap.
genus_frequencies <-
surveys %>%
group_by(genus) %>%
count()
genus_frequencies$freq <- round(genus_frequencies$n / sum(genus_frequencies$n), 4)
knitr::kable(genus_frequencies)
genus | n | freq |
---|---|---|
Ammospermophilus | 437 | 0.0128 |
Baiomys | 46 | 0.0013 |
Chaetodipus | 6029 | 0.1760 |
Dipodomys | 16167 | 0.4721 |
Neotoma | 1252 | 0.0366 |
Onychomys | 3267 | 0.0954 |
Perognathus | 1629 | 0.0476 |
Peromyscus | 2234 | 0.0652 |
Reithrodontomys | 2694 | 0.0787 |
Rodent | 10 | 0.0003 |
Sigmodon | 233 | 0.0068 |
Spermophilus | 249 | 0.0073 |
The probabilities above are based on full information. What happens if we only has partial information?
First, how can we get a subset of the data?
# Pick the number of rows to select
rows_to_keep <- sample(1:nrow(surveys), size = 100, replace = FALSE)
surveys_subset <- surveys[rows_to_keep, ]
Next let’s calculate the frequencies / probabilities with the subset of data.
gen_freq_sub <-
surveys_subset %>%
group_by(genus) %>%
count()
gen_freq_sub$freq <- round(gen_freq_sub$n / sum(gen_freq_sub$n), 4)
gen_freq_sub
## # A tibble: 8 × 3
## # Groups: genus [8]
## genus n freq
## <chr> <int> <dbl>
## 1 Ammospermophilus 4 0.04
## 2 Chaetodipus 16 0.16
## 3 Dipodomys 50 0.5
## 4 Neotoma 1 0.01
## 5 Onychomys 12 0.12
## 6 Perognathus 5 0.05
## 7 Peromyscus 3 0.03
## 8 Reithrodontomys 9 0.09
Next, let’s play with the numbers of samples, and see what happens to the frequencies compared to the real frequencies.
# Pick the number of rows to select
rows_to_keep <- sample(1:nrow(surveys), size = 1000, replace = FALSE)
surveys_subset <- surveys[rows_to_keep, ]
gen_freq_sub <-
surveys_subset %>%
group_by(genus) %>%
count()
gen_freq_sub$freq <- round(gen_freq_sub$n / sum(gen_freq_sub$n), 4)
gen_freq_sub
## # A tibble: 11 × 3
## # Groups: genus [11]
## genus n freq
## <chr> <int> <dbl>
## 1 Ammospermophilus 8 0.008
## 2 Baiomys 4 0.004
## 3 Chaetodipus 152 0.152
## 4 Dipodomys 483 0.483
## 5 Neotoma 30 0.03
## 6 Onychomys 95 0.095
## 7 Perognathus 43 0.043
## 8 Peromyscus 78 0.078
## 9 Reithrodontomys 97 0.097
## 10 Sigmodon 6 0.006
## 11 Spermophilus 4 0.004
Above, we set replace = FALSE
in the sample
function. This means that any row that is selected, can only be selected
once.
On the other hand, if we set replace = TRUE
, then any
row can be selected more than once by random.
# Pick the number of rows to select - WITH REPLACE = TRUE!
rows_to_keep <- sample(1:nrow(surveys), size = 1000, replace = TRUE)
surveys_subset <- surveys[rows_to_keep, ]
gen_freq_sub <-
surveys_subset %>%
group_by(genus) %>%
count()
gen_freq_sub$freq <- round(gen_freq_sub$n / sum(gen_freq_sub$n), 4)
gen_freq_sub
## # A tibble: 11 × 3
## # Groups: genus [11]
## genus n freq
## <chr> <int> <dbl>
## 1 Ammospermophilus 15 0.015
## 2 Baiomys 1 0.001
## 3 Chaetodipus 184 0.184
## 4 Dipodomys 465 0.465
## 5 Neotoma 44 0.044
## 6 Onychomys 82 0.082
## 7 Perognathus 51 0.051
## 8 Peromyscus 65 0.065
## 9 Reithrodontomys 82 0.082
## 10 Sigmodon 8 0.008
## 11 Spermophilus 3 0.003
With a sample size of 1000, these values should be pretty similar.
But there are reasons why you might choose replace = TRUE
.
For example, if you are projecting probabilities of capturing genera
in the future, then treating past observations as a way to
build generic distributions (using replace = TRUE
) may be
more appropriate.
P(Neotoma) = 0.0366 = 3.6%
These are mutually exclusive events. You cannot be both a Neotoma and a Dipodomys.
P(Neotoma OR Dipodomys) = 0.0366 + 0.4721 = 0.5087
This is called the complement of Onychomys
P(not Onychomys) = 1 - 0.0954 = 0.9046
What is the probability of capturing a female rodent?
What is the probability of capturing a female Dipodomys?
P(female AND Dipodomys)
Note - these are non-independent events. You need to use the General Addition Rule.
\[ P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B) \]
Two processes are independent if knowing the outcome of one provides no useful information about the outcome of the other. - Vu and Harrington 2020, p. 101
From Vu and Harrington 2020, p. 103
If A and B represent events from two different and independent processes, then the probability that both A and B occur is given by:
\[ P(A \text{ and } B) = P(A) \times P(B) \]
Similarly, if there are \(k\) events \(A_1\), …, \(A_k\) from \(k\) independent processes, then the probability they all occur is:
\[ P(A_1)P(A_2)...P(A_k) \]
First recall how we calculated the frequencies / probabilities of capturing each of the different genera of rodents. Then use the multiplication rule to calculate the two probabilities above.
genus_frequencies <-
surveys %>%
group_by(genus) %>%
count()
genus_frequencies$freq <- round(genus_frequencies$n / sum(genus_frequencies$n), 4)
knitr::kable(genus_frequencies)
genus | n | freq |
---|---|---|
Ammospermophilus | 437 | 0.0128 |
Baiomys | 46 | 0.0013 |
Chaetodipus | 6029 | 0.1760 |
Dipodomys | 16167 | 0.4721 |
Neotoma | 1252 | 0.0366 |
Onychomys | 3267 | 0.0954 |
Perognathus | 1629 | 0.0476 |
Peromyscus | 2234 | 0.0652 |
Reithrodontomys | 2694 | 0.0787 |
Rodent | 10 | 0.0003 |
Sigmodon | 233 | 0.0068 |
Spermophilus | 249 | 0.0073 |
Consider height in the US population.
What is the probability that a randomly selected individual in the population is taller than 6 feet, 4 inches?
The conditional probability of an event \(A\), given a second event \(B\), is the probability of \(A\) happening, knowing that the event \(B\) has happened.
What is the probability of trapping a Dipodomys merriami?
Let \(A\) = capture a Dipodomys merriami
species_frequencies <-
surveys %>%
group_by(genus, species) %>%
count()
species_frequencies <-
species_frequencies %>%
mutate(freq = round(n / sum(species_frequencies$n), 4))
What is the probability of trapping a Dipodomys merriami, conditional (or given) that you captures a Dipodomys?
Again, let \(A\) = capture a Dipodomys merriami, and now \(B\) = capture a Dipodomys.
dipo_frequencies <-
surveys %>%
filter(genus == "Dipodomys") %>%
group_by(species) %>%
count()
dipo_frequencies <-
dipo_frequencies %>%
mutate(freq = round(n / sum(dipo_frequencies$n), 4))
dipo_frequencies
## # A tibble: 4 × 3
## # Groups: species [4]
## species n freq
## <chr> <int> <dbl>
## 1 merriami 10596 0.655
## 2 ordii 3027 0.187
## 3 sp. 40 0.0025
## 4 spectabilis 2504 0.155
As long as \(P(B) > 0\), \[P(A|B) = \dfrac{P(A \cap B)}{P(B)}. \]
From the definition, \[\begin{align*} P(A|B) =& \dfrac{P(\text{capture a Dipodomys merriami AND capture a Dipodomys})}{P(\text{capture a Dipodomys})} \\ =& \dfrac{0.3094}{0.4721} = 0.6554 \end{align*}\]
A consequence of the definition of conditional probability:
If \(A\) and \(B\) represent two outcomes or events, then \[P(A \cap B) = P(A|B)P(B).\]
This follows from rearranging the definition of conditional probability: \[P(A|B) = \frac{P(A \cap B)}{P(B)} \rightarrow P(A|B)P(B) = P(A \cap B)\]
Unlike the previously mentioned multiplication rule, this is valid for events that might not be independent.
Bayes’ Thorem (simplest form):
\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]
Follows directly from the definition of conditional probability by noting that \(P(A) P(B|A)\) equals \(P(A \text{ and } B)\):
\[P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B|A)P(A)}{P(B)}\]
Foundation of Bayesian statistics:
\[P(\text{assumptions}|\text{data}) = \frac{P(\text{data}|\text{assumptions})P(\text{assumptions})}{P(\text{data})}\]