We’ll be working with the Portal Survey data this week (again). We must first load these data into our work environment. Also, we’ll be focusing only on the Rodent taxa, so let’s filter the data to include only rodents
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
surveys <- read_csv("../data/portal_data_joined.csv")
## Rows: 34786 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): species_id, sex, genus, species, taxa, plot_type
## dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
surveys <-
surveys %>%
filter(taxa == "Rodent")
A random variable is a function that maps each event in a sample space to a number.
Suppose \(X\) is the number of heads in 3 tosses of a fair coin.
The distribution of a discrete random variable is the collection of its values and the probabilities associated with those values.
The probability distribution for \(X\) is as follows:
\(x_i\) | 0 | 1 | 2 | 3 |
\(P(X = x_i)\) | 1/8 | 3/8 | 3/8 | 1/8 |
\[\sum_{x=0}^3 P(X = x_i) = 1 \]
We can look at a density plot of the frequency of the different genera, which would be a probability distribution for genera type.
ggplot(data = surveys, aes(x = genus, y = ..prop.., group = 1)) +
geom_bar() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90))
## Warning: The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(prop)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Key idea: We can think of the area for a particular bin as the probability of getting a value that falls into that bin.
There are many defined probability distributions that have specific properties. In the online lecture you learned about the binomial, Poisson, and normal distributions. Some properties of distributions to keep in mind:
The Binomial distribution can be thought of as the sum of \(n\) Bernoulli distributions (see below),
all with the same parameterization (i.e., probability of success = \(p\)).
This is useful if we want to find the probability of getting a certain
number of successes if your repeat some experiment many times.
\[ {pdf} = P(x | N,h) = \binom{N}{x} h^x (1-h)^{N-x} \]
Question: What is the number of successes, \(x\), in \(N\) trials, where the probability of a success is \(h\)
The Bernoulli distribution is a very simple distribution that can be used if we have a single event (or experiment) that has two possible outcomes, governed by some probability. For example, the probability of the outcome of a single coin toss with a fair coin can be described using a Bernoulli distribution.
\({pdf} = P(x) = p\) if \(x=1\) and \((1-p)\) if \(x=0\), where \(x\) is the event outcome (i.e., heads or tails)
Let’s consider the probability of trapping a Dipodomys versus not a Dipodomys.
P(trap a Dipodomys) = 0.4721
Now, let’s look at R’s internal Binomial distribution function.
rbinom(n = 1, size = 100, prob = 0.4721)
## [1] 55
new_data <- rbinom(n = 100, size = 100, prob = 0.4721)
ggplot() +
geom_histogram(data = NULL,
aes(x = new_data, y = ..density.. ),
binwidth = 1) +
theme_bw()
Let’s look at this empirically from our data.
How many Dipodomys trappings do we get if we randomly selected 100 of the trapping events?
# select 100 random rows
rand_rows <- sample(1:nrow(surveys), size = 100)
# grab these rows as a subset of the surveys data set
surveys_subset <- surveys[rand_rows, ]
# count the number of dipodomys
sum(surveys_subset$genus == "Dipodomys")
## [1] 46
This is equivalent to a single draw from a Binomial distribution.
Next, let’s try multiple draws from our data set.
# Create an empty vector of dipo_counts
dipo_counts <- c()
for(x in 1:100){
# select 100 random rows
rand_rows <- sample(1:nrow(surveys), size = 100)
# grab these rows as a subset of the surveys data set
surveys_subset <- surveys[rand_rows, ]
# count the number of dipodomys and add this count to
# our dipo_counts vector
dipo_counts <- c(dipo_counts, sum(surveys_subset$genus == "Dipodomys"))
}
Now, let’s make a histogram of the dipo_counts
data.
ggplot() +
geom_histogram(data = NULL,
aes(x = dipo_counts, y = ..density.. ),
binwidth = 1) +
theme_bw()
Using the code above, experiment with what happens when you change
values for n and / or sample size. In this case,
n was equal to the number of iterations of our for
loop and sample size was equal to the number of random rows
sample
d from the overall data set.
The normal distribution is perhaps the most widely used distribution in life science. It is also probably the most familiar.
The probability density function for the normal is:
\[ f(x|\mu,\sigma) \sim \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]
Let’s dive into working with the Normal in R.
In R, there are several things you can do with all kinds of
distributions (not just the normal). Start by looking for help on
rnorm
?rnorm
Note that there is dnorm
, pnorm
,
qnorm
, and rnorm
. For now, know that
rnorm
generates random variables from the distribution.
normal_rvs <- rnorm(n = 100, mean = 0, sd = 1)
ggplot() +
geom_histogram( data = NULL, aes(x = normal_rvs) ) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
mean
and sd
parameters.The Standard Normal distribution is a Normal distribution with \(\mu = 0\) and \(\sigma = 1\).
Let’s look at the distribution of the weight of captured Chaetodipus.
chaetodipus <-
surveys %>%
filter(genus == "Chaetodipus")
ggplot(data = chaetodipus, aes(x = weight, y = ..density..)) +
geom_histogram() +
geom_density()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_density()`).
The observed distribution does not appear to be normally distributed. But we can change the scale of the values so they appear more normal-like. The most common transformation for this is a log transformation, either a \(log_e\) or \(log_{10}\) transformation.
chaetodipus$log_weight <- log10(chaetodipus$weight)
ggplot(data = chaetodipus, aes(x = log_weight, y = ..density..)) +
geom_histogram() +
geom_density()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_density()`).
OK. Much closer. But there does appear to be some bimodality in the distribution. Why? Perhaps this is due to sex differences.
ggplot(data = chaetodipus, aes(x = log_weight, y = ..density.., fill = sex)) +
geom_histogram(position = "dodge") +
geom_density(alpha = 0.3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_density()`).
There appears to be some differences between sexes, particularly larger females. But there does appear to be two modes even for each sex.
Let’s push on and compare our observations to an assumed normal
distribution of weights. To parameterize our distribution, we need to
calculate the mean log_weight
and standard deviation of
log_weight
.
mean_chaeto_logweight <- mean(chaetodipus$log_weight, na.rm = TRUE)
sd_chaeto_logweight <- sd(chaetodipus$log_weight, na.rm = TRUE)
ggplot(data = chaetodipus) +
geom_histogram(aes(x = log_weight, y = ..density..)) +
geom_density(aes(x = log_weight)) +
stat_function(fun = dnorm,
args = list(mean = mean_chaeto_logweight, sd = sd_chaeto_logweight),
colour = "blue", alpha = 0.6, size = 2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 185 rows containing non-finite outside the scale range
## (`stat_density()`).
OK, now let’s calculate some frequencies and probabilities, using both the observed data and assuming a normal distribution.
Based on observations - count the number of Chaetodipus trapped that are greater than 1.35 and divide that number by the total number of Chaetodipus trapped.
sum(chaetodipus$log_weight > mean_chaeto_logweight, na.rm = TRUE) / nrow(chaetodipus)
## [1] 0.4572898
Based on assumption of normal distribution - use R’s internal functions.
pnorm(q = mean_chaeto_logweight,
mean = mean_chaeto_logweight,
sd = sd_chaeto_logweight,
lower.tail = FALSE)
## [1] 0.5
There are many questions we may ask that are related to the Poisson distribution. Usually we think of the Poisson when we have some process that usually results in a small number most of the times and produces larger numbers less frequently. Think about the number of eggs produced by some bird, or the number of off spring for some animal. By substituting the usual \(\lambda\) value in the Poisson with \(\lambda T\), where \(T\) is some defined time period, and \(\lambda\) is some rate value, we can use the Poisson to address questions concerning the number of successes in some time period T.
\[ {pdf} = P(x|T,\lambda) = \frac{ e^{-(\lambda T)} (\lambda T)^x }{ x! } \]
Calculate the rates (number of captures per year) for all of the species.
surveys %>%
group_by(genus, species, year) %>%
count() %>%
mutate(genus_species = paste(genus, species)) %>%
group_by(genus_species) %>%
summarise(avg_cnt = mean(n)) %>%
arrange(avg_cnt)
## # A tibble: 29 × 2
## genus_species avg_cnt
## <chr> <dbl>
## 1 Reithrodontomys sp. 1
## 2 Spermophilus tereticaudus 1
## 3 Rodent sp. 1.43
## 4 Chaetodipus sp. 1.5
## 5 Onychomys sp. 2
## 6 Chaetodipus intermedius 2.25
## 7 Dipodomys sp. 2.67
## 8 Perognathus hispidus 3.56
## 9 Reithrodontomys montanus 4
## 10 Sigmodon fulviventer 7.17
## # ℹ 19 more rows
What is the probability that we will capture exactly 5 rock pocket mice in the coming year’s survey?
dpois(x = 5, lambda = 2.25)
## [1] 0.05064875
What is the probability that we will capture 5 or more rock pocket mice?
ppois(q = 4, lambda = 2.25, lower.tail = FALSE)
## [1] 0.07801411