Introduction to Probability

Section Goals
Meeting preliminaries
Probability
- Frequentist statistics
- Bayesian statistics
Properties of probability

Section Goals

Introduce some basic principles of probability, including conecpts of independence and conditional probability
Learn to simulate some basic probabilistic events
Apply R data management skills to calculate frequencies and probabilities of events using data sets

Meeting preliminaries

We’ll be working with the Portal Survey data this week. We must first load these data into our work environment. Also, we’ll be focusing only on the Rodent taxa, so let’s filter the data to include only rodents

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

surveys <- read_csv("../data/portal_data_joined.csv")

## Rows: 34786 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): species_id, sex, genus, species, taxa, plot_type
## dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

surveys <-
  surveys %>%
  filter(taxa == "Rodent")

Probability

We usually talk about probabilities in terms of events; the probability of event A occurring is written P(A). Probabilities can be between zero and one; if P(A) equals zero, then the event is impossible; if P(A) equals one, then the event is certain. - Quinn and Keough 2002

Frequentist statistics

The relative frequency of an event over the long term, after many trials.

E.g., throwing a coin. What’s the probability of a ‘heads’? After one trial, two trials, …, 1000 trials.

Here’s an example of throwing a single coin.

(head <- runif(1) > 0.5)

## [1] TRUE

Bayesian statistics

For now, all I will say is that there are other perspectives in statistics. A key difference between Frequentist and Bayesian statistics is that the latter allows for you to incorporate prior knowledge.

Flipping a coin - frequentist vs bayesian perspective

You find a coin on the ground, and want to know the probability that you will get a heads if you flip the coin. A frequentist approach would flip the coin many times and assume the probability of getting a heads is approximately the frequency of heads. A Bayesian approach would assume that the probability is 0.5 based on a naive guess, then collect data (i.e., flip the coin) and update the prior probability using the newly acquired data.

Properties of probability

a random process gives rise to an outcome
- a coin-flip is a random process
- capturing a Peromyscus in the Portal traps
0 \(\le\) P(some event) \(\le\) 1
\(\hat{p}_n\) is the proportion of outcomes out of the total number of trials run
- If we flip a coin 100 times, and 52 times we got heads, \(\hat{p}_{100} = \frac{52}{100}\)
- Out of 34247 total rodent captures, 2234 were Peromyscus, thus the probability of capturing a Peromyscus is 0.07

Law of Large Numbers

The tendency of \(\hat{p}_n\) to stabilize around \(p\) [as \(n\) increases] is described by the Law of Large Numbers. - Vu and Harrington 2020, p. 92

Example of law of large numbers

Let’s start by calculating the frequency of each of the genera captures in the Portal traps. These frequency values can be interpreted as probability values of catching a particular genus in a trap.

genus_frequencies <-
  surveys %>%
  group_by(genus) %>%
  count()

genus_frequencies$freq <- round(genus_frequencies$n / sum(genus_frequencies$n), 4)

knitr::kable(genus_frequencies)

genus	n	freq
Ammospermophilus	437	0.0128
Baiomys	46	0.0013
Chaetodipus	6029	0.1760
Dipodomys	16167	0.4721
Neotoma	1252	0.0366
Onychomys	3267	0.0954
Perognathus	1629	0.0476
Peromyscus	2234	0.0652
Reithrodontomys	2694	0.0787
Rodent	10	0.0003
Sigmodon	233	0.0068
Spermophilus	249	0.0073

The probabilities above are based on full information. What happens if we only has partial information?

First, how can we get a subset of the data?

# Pick the number of rows to select
rows_to_keep <- sample(1:nrow(surveys), size = 100, replace = FALSE)

surveys_subset <- surveys[rows_to_keep, ]

Next let’s calculate the frequencies / probabilities with the subset of data.

gen_freq_sub <-
  surveys_subset %>%
  group_by(genus) %>%
  count()

gen_freq_sub$freq <- round(gen_freq_sub$n / sum(gen_freq_sub$n), 4)

gen_freq_sub

## # A tibble: 8 × 3
## # Groups:   genus [8]
##   genus                n  freq
##   <chr>            <int> <dbl>
## 1 Ammospermophilus     4  0.04
## 2 Chaetodipus         16  0.16
## 3 Dipodomys           50  0.5 
## 4 Neotoma              1  0.01
## 5 Onychomys           12  0.12
## 6 Perognathus          5  0.05
## 7 Peromyscus           3  0.03
## 8 Reithrodontomys      9  0.09

Next, let’s play with the numbers of samples, and see what happens to the frequencies compared to the real frequencies.

# Pick the number of rows to select
rows_to_keep <- sample(1:nrow(surveys), size = 1000, replace = FALSE)

surveys_subset <- surveys[rows_to_keep, ]

gen_freq_sub <-
  surveys_subset %>%
  group_by(genus) %>%
  count()

gen_freq_sub$freq <- round(gen_freq_sub$n / sum(gen_freq_sub$n), 4)

gen_freq_sub

## # A tibble: 11 × 3
## # Groups:   genus [11]
##    genus                n  freq
##    <chr>            <int> <dbl>
##  1 Ammospermophilus     8 0.008
##  2 Baiomys              4 0.004
##  3 Chaetodipus        152 0.152
##  4 Dipodomys          483 0.483
##  5 Neotoma             30 0.03 
##  6 Onychomys           95 0.095
##  7 Perognathus         43 0.043
##  8 Peromyscus          78 0.078
##  9 Reithrodontomys     97 0.097
## 10 Sigmodon             6 0.006
## 11 Spermophilus         4 0.004

Aside: sampling with vs. without replacement

Above, we set replace = FALSE in the sample function. This means that any row that is selected, can only be selected once.

On the other hand, if we set replace = TRUE, then any row can be selected more than once by random.

# Pick the number of rows to select - WITH REPLACE = TRUE!
rows_to_keep <- sample(1:nrow(surveys), size = 1000, replace = TRUE)

surveys_subset <- surveys[rows_to_keep, ]

gen_freq_sub <-
  surveys_subset %>%
  group_by(genus) %>%
  count()

gen_freq_sub$freq <- round(gen_freq_sub$n / sum(gen_freq_sub$n), 4)

gen_freq_sub

## # A tibble: 11 × 3
## # Groups:   genus [11]
##    genus                n  freq
##    <chr>            <int> <dbl>
##  1 Ammospermophilus    15 0.015
##  2 Baiomys              1 0.001
##  3 Chaetodipus        184 0.184
##  4 Dipodomys          465 0.465
##  5 Neotoma             44 0.044
##  6 Onychomys           82 0.082
##  7 Perognathus         51 0.051
##  8 Peromyscus          65 0.065
##  9 Reithrodontomys     82 0.082
## 10 Sigmodon             8 0.008
## 11 Spermophilus         3 0.003

With a sample size of 1000, these values should be pretty similar. But there are reasons why you might choose replace = TRUE. For example, if you are projecting probabilities of capturing genera in the future, then treating past observations as a way to build generic distributions (using replace = TRUE) may be more appropriate.

Specific examples of calculating probabilities

Probability of capturing a Neotoma

P(Neotoma) = 0.0366 = 3.6%

Probability of capturing a Neotoma OR a Dipodomys

These are mutually exclusive events. You cannot be both a Neotoma and a Dipodomys.

P(Neotoma OR Dipodomys) = 0.0366 + 0.4721 = 0.5087

Probability of capturing something other than Onychomys

This is called the complement of Onychomys

P(not Onychomys) = 1 - 0.0954 = 0.9046

Challenge - more with probability

What is the probability of capturing a female rodent?
What is the probability of capturing a female Dipodomys?

P(female AND Dipodomys)

What is the probability of capturing a female rodent OR a Dipodomys?

Note - these are non-independent events. You need to use the General Addition Rule.

\[ P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B) \]

Independence

Two processes are independent if knowing the outcome of one provides no useful information about the outcome of the other. - Vu and Harrington 2020, p. 101

Challenge

Is flipping two coins different from flipping one, then the other?
You flip 99 heads. What is the probability you will flip a head on the 100th toss?
You catch a Neotoma in a trap in one of the Portal plots. What is the probability that you will catch a Neotoma in a trap in one of the other Portal plots?

Multiplication Rule for Independent Processes

From Vu and Harrington 2020, p. 103

If A and B represent events from two different and independent processes, then the probability that both A and B occur is given by:

\[ P(A \text{ and } B) = P(A) \times P(B) \]

Similarly, if there are \(k\) events \(A_1\), …, \(A_k\) from \(k\) independent processes, then the probability they all occur is:

\[ P(A_1)P(A_2)...P(A_k) \]

Challenge - probability of capturing 100 Dipodomys in a row vs capturing 100 Chaetodipus in a row.

First recall how we calculated the frequencies / probabilities of capturing each of the different genera of rodents. Then use the multiplication rule to calculate the two probabilities above.

genus_frequencies <-
  surveys %>%
  group_by(genus) %>%
  count()

genus_frequencies$freq <- round(genus_frequencies$n / sum(genus_frequencies$n), 4)

knitr::kable(genus_frequencies)

genus	n	freq
Ammospermophilus	437	0.0128
Baiomys	46	0.0013
Chaetodipus	6029	0.1760
Dipodomys	16167	0.4721
Neotoma	1252	0.0366
Onychomys	3267	0.0954
Perognathus	1629	0.0476
Peromyscus	2234	0.0652
Reithrodontomys	2694	0.0787
Rodent	10	0.0003
Sigmodon	233	0.0068
Spermophilus	249	0.0073

Conditional Probability

Conditional Probability: Intuition

Consider height in the US population.

What is the probability that a randomly selected individual in the population is taller than 6 feet, 4 inches?

Suppose you learn that the individual is a professional basketball player. Does this change the probability that the individual is taller than 6 feet, 4 inches?

Conditional Probability: Concept

The conditional probability of an event \(A\), given a second event \(B\), is the probability of \(A\) happening, knowing that the event \(B\) has happened.

This conditional probability is denoted \(P(A|B)\)

What is the probability of trapping a Dipodomys merriami?

Let \(A\) = capture a Dipodomys merriami

species_frequencies <-
  surveys %>%
  group_by(genus, species) %>%
  count()

species_frequencies <-
  species_frequencies %>%
  mutate(freq = round(n / sum(species_frequencies$n), 4))

What is the probability of trapping a Dipodomys merriami, conditional (or given) that you captures a Dipodomys?

Again, let \(A\) = capture a Dipodomys merriami, and now \(B\) = capture a Dipodomys.

dipo_frequencies <-
  surveys %>%
  filter(genus == "Dipodomys") %>%
  group_by(species) %>%
  count()

dipo_frequencies <-
  dipo_frequencies %>%
  mutate(freq = round(n / sum(dipo_frequencies$n), 4))

dipo_frequencies

## # A tibble: 4 × 3
## # Groups:   species [4]
##   species         n   freq
##   <chr>       <int>  <dbl>
## 1 merriami    10596 0.655 
## 2 ordii        3027 0.187 
## 3 sp.            40 0.0025
## 4 spectabilis  2504 0.155

Conditional Probability: Formal Definition

As long as \(P(B) > 0\), \[P(A|B) = \dfrac{P(A \cap B)}{P(B)}. \]

From the definition, \[\begin{align*} P(A|B) =& \dfrac{P(\text{capture a Dipodomys merriami AND capture a Dipodomys})}{P(\text{capture a Dipodomys})} \\ =& \dfrac{0.3094}{0.4721} = 0.6554 \end{align*}\]

Independence, Again…

A consequence of the definition of conditional probability:

If \(P(A|B) = P(A)\), then \(A\) and \(B\) are independent; knowing \(B\) offers no information about whether \(A\) occurred.

General Multiplication Rule

If \(A\) and \(B\) represent two outcomes or events, then \[P(A \cap B) = P(A|B)P(B).\]

This follows from rearranging the definition of conditional probability: \[P(A|B) = \frac{P(A \cap B)}{P(B)} \rightarrow P(A|B)P(B) = P(A \cap B)\]

Unlike the previously mentioned multiplication rule, this is valid for events that might not be independent.

Bayes’ Theorem, aka Bayes’ Rule

Bayes’ Thorem (simplest form):

\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

Follows directly from the definition of conditional probability by noting that \(P(A) P(B|A)\) equals \(P(A \text{ and } B)\):

\[P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B|A)P(A)}{P(B)}\]

Foundation of Bayesian statistics:

\[P(\text{assumptions}|\text{data}) = \frac{P(\text{data}|\text{assumptions})P(\text{assumptions})}{P(\text{data})}\]