One of the biggest advantages to R is that it has a rich ecosystem of add-ons, also known as packages. These packages are written my members of the R community (for the most part), and freely available on CRAN. As a class, let’s navigate to CRAN and have a look at the packages section of the site.
Now, to install and load a package.
install.packages("VennDiagram")
Then to load it into my workspace, and make the functions available:
library(VennDiagram)
## Warning: package 'VennDiagram' was built under R version 3.4.3
## Loading required package: grid
## Loading required package: futile.logger
Let’s draw a Venn diagram representing the probability of getting a blue M&M. This is written algebraically as \(P(Blue)\):
## Load VennDiagram package - NB: you will likely have to install this package first
library(VennDiagram)
## Create a Venn diagram
draw.single.venn(area = 0.24, category = "Blue M&M", fill = "blue", alpha = 0.5)
Draw a Venn diagram representing the probability of getting a blue or red M&M. This is written algebraically as \(P(Blue \cup Red)\):
grid.newpage()
draw.pairwise.venn(area1 = 0.24, area2 = 0.13, cross.area = 0,
category = c("Blue M&M", "Red M&M"),
fill = c("blue", "red"), alpha = rep(0.5, 2))
IMPORTANT: Note that there is no overlap between these two events. We say that these events are mutually exclusive.
Let’s imagine the scenario that we add an equal amount of Peanut and Almond M&Ms to our bags of M&Ms. Now we can ask a question like, what is the probability of getting a blue AND peanut M&M? We can write this algebraically as \(P(Blue \cap Peanut)\).
Recall that \(P(Blue) = 0.24\), and because all three types of M&Ms are represented equally, the \(P(Peanut) = 0.33\).
grid.newpage()
draw.pairwise.venn(area1 = 0.24, area2 = 0.33, cross.area = 0.0792,
category = c("Blue M&M", "Peanut M&M"),
fill = c("blue", "yellow"), alpha = rep(0.5, 2))
IMPORTANT: Note that in this example, the color and type of M&M are indpendent from each other.
What is the probability of getting a blue OR peanut M&M? I.e. \(P(Blue \cup Peanut)\).
Hint: Note the difference between adding up the area represented in the Venn diagram versus adding the \(P(Blue) + P(Peanut)\).
What is the probability of drawing a blue M&M, CONDITIONAL on it being a peanut M&M?
\[ P(Blue|Peanut) = \frac{P(Blue \cap Peanut)}{P(Peanut)} \]
Calculate this probability.1
We can use a bit of algebra and re-write a general form of the equation above as:
\[ P(A|B) P(B) = P(A \cap B) \]
Now, because \(P(A \cap B)\) is identical to \(P(B \cap A)\), we can expand this further too:
\[ P(A|B) P(B) = P(A \cap B) = P(B|A) P(A) \]
And after a little more algebra, we arrive at Bayes Theorem:
\[ P(A|B) = \frac{P(B|A) P(A)}{P(B)} = \frac{P(B|A) P(A)}{P(B|A)P(A) + P(B|~A)P(~A)} \]
The first step to any data analysis is to interrogate your data by calculating some standard statistics and by visualizing it it various ways. In future classes we will look more closely at standard statistics. Today, we’ll focus on data visualization.
We’re going to work with a dataset that comes built in to R, commonly called the iris dataset. It is also sometimes called Fisher’s Iris dataset (but should more appropriately be called Anderson’s Iris dataset). Because it comes pre-packaged with R, we can load it into our environment using the data
function.
data( iris )
Let’s have a look at this dataset
head( iris )
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail( iris )
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
The dataset contains measurements of four characteristics of three different species of Iris (plants!).
summary
functionLet’s begin by using the summary
function to examine this dataset. summary
returns many of the standard statistics. When doing data exploration, a few things you want to look at are:
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Factors ‘appear’ to be similar to characters, but are in fact coded numerically in R. Think of factors like categories. Here’s a quick example that demonstrates the difference in these two variable types that shows up when using summary
.
## Make a new iris dataset
iris_new <- iris
## Create a new column that treats species as a character, rather than a factor
iris_new$species_char <- as.character(iris_new$Species)
## Run summary command
summary(iris_new)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species species_char
## setosa :50 Length:150
## versicolor:50 Class :character
## virginica :50 Mode :character
##
##
##
What am I going to get if I execute the command below?
head(iris[c("Sepal.Width","Petal.Length")])
base
R functionsLet’s add a habitat type variable to the iris dataset. We’ll use this later. Caveat - I made up these habitat type preferences.
iris_habitat <- data.frame( Species = c( "setosa", "versicolor", "virginica" ),
Habitat = c( "forest", "wetland", "meadow" ) )
iris_full <- merge( x = iris, y = iris_habitat, by = "Species" )
head( iris_full )
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width Habitat
## 1 setosa 5.1 3.5 1.4 0.2 forest
## 2 setosa 4.9 3.0 1.4 0.2 forest
## 3 setosa 4.7 3.2 1.3 0.2 forest
## 4 setosa 4.6 3.1 1.5 0.2 forest
## 5 setosa 5.0 3.6 1.4 0.2 forest
## 6 setosa 5.4 3.9 1.7 0.4 forest
tail( iris_full )
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width Habitat
## 145 virginica 6.7 3.3 5.7 2.5 meadow
## 146 virginica 6.7 3.0 5.2 2.3 meadow
## 147 virginica 6.3 2.5 5.0 1.9 meadow
## 148 virginica 6.5 3.0 5.2 2.0 meadow
## 149 virginica 6.2 3.4 5.4 2.3 meadow
## 150 virginica 5.9 3.0 5.1 1.8 meadow
Perhaps the most common way to look at data for a single variable is a histogram. This essentially is a bar plot, where each bar represents the number of times a value falls within a particular bin.
Example: let’s look at the distribution of Petal.Length
values
hist(iris_full$Petal.Length)
?hist
to determine how to change the number of bins usedHint: bins are sometimes called breaks
?hist
hist(iris_full$Petal.Length, breaks = 30)
We would like to plot Sepal.Length versus Petal.Length. We’ll first do this using base R functions.
plot( x = iris_full$Sepal.Length, y = iris_full$Petal.Length )
Note, we could save a little typing by using the with
function. This sets up a temporary environment where all of the variables (columns) in my dataset are defined individually.
with( data = iris_full, plot( x = Sepal.Length, y = Petal.Length ) )
ggplot2
packageNow we’re going to introduce a data visualization package called ggplot2. This package is great for producing publication quality graphics, but the syntax used to create a plot is a little more involved.
First, let’s install the ggplot2
package:
# Only need to do this once
install.packages("ggplot2")
Then load it:
library(ggplot2)
OK, let’s fire up ggplot.
ggplot() + # First make an empty ggplot canvas. Notice the trailing plus sign
geom_point( data = iris_full, aes( x = Sepal.Length, y = Petal.Length ) )
Let’s break down that call to introduce a few key things about ggplot
THAT SEEMS SO COMPLICATED!
It’s true, for simple plots, ggplot can be much more complicated than simply using the base functions. But the power of ggplot lies in the ability to lay several geometries (geoms) over each other. Also, each geometry has a rich set of options. For example, let’s say I want to create the plot we just made, but have each species represented by a different color.
ggplot() + # First make an empty ggplot canvas. Notice the trailing plus sign
geom_point( data = iris_full, aes( x = Sepal.Length, y = Petal.Length, colour = Species ) )
Let’s add more information - how about habitat type as well.
ggplot() + # First make an empty ggplot canvas. Notice the trailing plus sign
geom_point( data = iris_full, aes( x = Sepal.Length, y = Petal.Length, colour = Species, shape = Habitat), size = 2.5 ) +
theme_bw()
facets - a way to separate data into different subplots
Let’s say we wanted different plots for each species. We can do that in ggplot using facets
.
ggplot( data = iris_full, aes( x = Sepal.Length, y = Petal.Length ) ) +
geom_point() +
facet_grid( Species ~ . )
NOTE: I moved the data ...
stuff into the initial ggplot
call here.
ggplot2
has many geometries, allowing us to make lot’s of different types of plots. Let’s make two new plots - one boxplot of Petal.Length
, with one boxplot for each species. Use geom_boxplot
for this.ggplot(data = iris_full, aes(x = Species, y = Petal.Length)) +
geom_boxplot(aes(fill = Species)) +
theme_bw()
ggplot() +
geom_histogram(data = iris_full, aes(x = Petal.Length)) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Use facets to separate the three species.
ggplot() +
geom_histogram(data = iris_full, aes(x = Petal.Length, fill = Species)) +
facet_grid( Species ~ . ) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
geom_density
and colour
to make different colored density lines for iris in each habitat type. Note: the area under the curve is equal to 1.ggplot() +
geom_density(data = iris_full, aes(x = Petal.Length, colour = Species)) +
theme_bw()
ggplot() +
geom_histogram(data = iris_full,
aes(x = Petal.Length, y = ..density.., fill = Species)) +
facet_grid( Species ~ . ) +
geom_density(data = iris_full, aes(x = Petal.Length, colour = Species)) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot2
resourcesThe value should be 0.24. Why?↩