Review of Lecture 4

Visualizing probability - Venn Diagrams

Aside: getting and loading R packages

One of the biggest advantages to R is that it has a rich ecosystem of add-ons, also known as packages. These packages are written my members of the R community (for the most part), and freely available on CRAN. As a class, let’s navigate to CRAN and have a look at the packages section of the site.

Now, to install and load a package.

install.packages("VennDiagram")

Then to load it into my workspace, and make the functions available:

library(VennDiagram)
## Warning: package 'VennDiagram' was built under R version 3.4.3
## Loading required package: grid
## Loading required package: futile.logger

Let’s draw a Venn diagram representing the probability of getting a blue M&M. This is written algebraically as \(P(Blue)\):

## Load VennDiagram package - NB: you will likely have to install this package first
library(VennDiagram)

## Create a Venn diagram
draw.single.venn(area = 0.24, category = "Blue M&M", fill = "blue", alpha = 0.5)

Draw a Venn diagram representing the probability of getting a blue or red M&M. This is written algebraically as \(P(Blue \cup Red)\):

grid.newpage()
draw.pairwise.venn(area1 = 0.24, area2 = 0.13, cross.area = 0, 
                   category = c("Blue M&M", "Red M&M"),
                   fill = c("blue", "red"), alpha = rep(0.5, 2))

IMPORTANT: Note that there is no overlap between these two events. We say that these events are mutually exclusive.

Adding Peanut and Almond M&Ms to the mix

Let’s imagine the scenario that we add an equal amount of Peanut and Almond M&Ms to our bags of M&Ms. Now we can ask a question like, what is the probability of getting a blue AND peanut M&M? We can write this algebraically as \(P(Blue \cap Peanut)\).

Recall that \(P(Blue) = 0.24\), and because all three types of M&Ms are represented equally, the \(P(Peanut) = 0.33\).

grid.newpage()
draw.pairwise.venn(area1 = 0.24, area2 = 0.33, cross.area = 0.0792, 
                   category = c("Blue M&M", "Peanut M&M"),
                   fill = c("blue", "yellow"), alpha = rep(0.5, 2))

IMPORTANT: Note that in this example, the color and type of M&M are indpendent from each other.

Challange

What is the probability of getting a blue OR peanut M&M? I.e. \(P(Blue \cup Peanut)\).

Hint: Note the difference between adding up the area represented in the Venn diagram versus adding the \(P(Blue) + P(Peanut)\).

Conditional probability


Challange

What is the probability of drawing a blue M&M, CONDITIONAL on it being a peanut M&M?

\[ P(Blue|Peanut) = \frac{P(Blue \cap Peanut)}{P(Peanut)} \]

Calculate this probability.1


More general thinking

We can use a bit of algebra and re-write a general form of the equation above as:

\[ P(A|B) P(B) = P(A \cap B) \]

Now, because \(P(A \cap B)\) is identical to \(P(B \cap A)\), we can expand this further too:

\[ P(A|B) P(B) = P(A \cap B) = P(B|A) P(A) \]

And after a little more algebra, we arrive at Bayes Theorem:

\[ P(A|B) = \frac{P(B|A) P(A)}{P(B)} = \frac{P(B|A) P(A)}{P(B|A)P(A) + P(B|~A)P(~A)} \]


Data Visualization and Exploration

The first step to any data analysis is to interrogate your data by calculating some standard statistics and by visualizing it it various ways. In future classes we will look more closely at standard statistics. Today, we’ll focus on data visualization.

Iris dataset

We’re going to work with a dataset that comes built in to R, commonly called the iris dataset. It is also sometimes called Fisher’s Iris dataset (but should more appropriately be called Anderson’s Iris dataset). Because it comes pre-packaged with R, we can load it into our environment using the data function.

data( iris )

Let’s have a look at this dataset

head( iris )
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
tail( iris )
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

The dataset contains measurements of four characteristics of three different species of Iris (plants!).

summary function

Let’s begin by using the summary function to examine this dataset. summary returns many of the standard statistics. When doing data exploration, a few things you want to look at are:

  • How to the mean and median values within a variable compare?
  • Do the min and max values suggest there are outliers?
  • Which variables (i.e., columns) are quantitative (numeric) versus categorical (factors or characters)
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Aside: Differences between characters and factors

Factors ‘appear’ to be similar to characters, but are in fact coded numerically in R. Think of factors like categories. Here’s a quick example that demonstrates the difference in these two variable types that shows up when using summary.

## Make a new iris dataset
iris_new <- iris

## Create a new column that treats species as a character, rather than a factor
iris_new$species_char <- as.character(iris_new$Species)

## Run summary command
summary(iris_new)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species   species_char      
##  setosa    :50   Length:150        
##  versicolor:50   Class :character  
##  virginica :50   Mode  :character  
##                                    
##                                    
## 

Aside: A (very) brief introduction to navigating a data.frame

We will be very brief here. I recommend checking out this Data Carpentry lesson for more information.

  • Looking at specific data.frame elements. Use the row and column notation.

Here is the 5th row, 3rd column (Petal.Length). Note: We are using square brackets to index the data.frame and we always use row, column notation.

iris[5, 3]
## [1] 1.4
  • Looking at an entire column.

Here are two ways to get the Petal.Length column.

First, note: we leave the row part blank, but still add the comma.

iris[ ,3]
##   [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3
##  [18] 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4
##  [35] 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7
##  [52] 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1
##  [69] 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5
##  [86] 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1
## [103] 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9
## [120] 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1
## [137] 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1

Second, use only the variable (column) name. Note the use of the $ operator

iris$Petal.Length
##   [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3
##  [18] 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4
##  [35] 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7
##  [52] 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1
##  [69] 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5
##  [86] 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1
## [103] 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9
## [120] 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1
## [137] 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1
  • Looking at specific column entry

This is another way to look at the 5th entry in the Petal.Length column.

iris$Petal.Length[5]
## [1] 1.4
  • Looking at all entries for a given row.

Here’s all the entries for the 5th row. Note: here we leave the column part blank, but still add the comma.

iris[5, ]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 5            5         3.6          1.4         0.2  setosa
  • Looking at a set of rows and/or columns.

Here’s all the entries in the 5th through 10th rows, 1st through 3rd columns. Note: we use the : operator to look at a range of value.

iris[5:10, 1:3]
##    Sepal.Length Sepal.Width Petal.Length
## 5           5.0         3.6          1.4
## 6           5.4         3.9          1.7
## 7           4.6         3.4          1.4
## 8           5.0         3.4          1.5
## 9           4.4         2.9          1.4
## 10          4.9         3.1          1.5
  • For data.frames, if you do not use row, column notation, you will get only the columns back.
head(iris[2:3])
##   Sepal.Width Petal.Length
## 1         3.5          1.4
## 2         3.0          1.4
## 3         3.2          1.3
## 4         3.1          1.5
## 5         3.6          1.4
## 6         3.9          1.7

Challange

What am I going to get if I execute the command below?

head(iris[c("Sepal.Width","Petal.Length")])

Visualization using base R functions

Let’s add a habitat type variable to the iris dataset. We’ll use this later. Caveat - I made up these habitat type preferences.

iris_habitat <- data.frame( Species = c( "setosa", "versicolor", "virginica" ),
                            Habitat = c( "forest", "wetland", "meadow" ) )

iris_full <- merge( x = iris, y = iris_habitat, by = "Species" )

head( iris_full )
##   Species Sepal.Length Sepal.Width Petal.Length Petal.Width Habitat
## 1  setosa          5.1         3.5          1.4         0.2  forest
## 2  setosa          4.9         3.0          1.4         0.2  forest
## 3  setosa          4.7         3.2          1.3         0.2  forest
## 4  setosa          4.6         3.1          1.5         0.2  forest
## 5  setosa          5.0         3.6          1.4         0.2  forest
## 6  setosa          5.4         3.9          1.7         0.4  forest
tail( iris_full )
##       Species Sepal.Length Sepal.Width Petal.Length Petal.Width Habitat
## 145 virginica          6.7         3.3          5.7         2.5  meadow
## 146 virginica          6.7         3.0          5.2         2.3  meadow
## 147 virginica          6.3         2.5          5.0         1.9  meadow
## 148 virginica          6.5         3.0          5.2         2.0  meadow
## 149 virginica          6.2         3.4          5.4         2.3  meadow
## 150 virginica          5.9         3.0          5.1         1.8  meadow

Visualizing the measurements of a single variable

Perhaps the most common way to look at data for a single variable is a histogram. This essentially is a bar plot, where each bar represents the number of times a value falls within a particular bin.

Example: let’s look at the distribution of Petal.Length values

hist(iris_full$Petal.Length)


Challange - use ?hist to determine how to change the number of bins used

Hint: bins are sometimes called breaks

?hist
hist(iris_full$Petal.Length, breaks = 30)


Visualizing relationships between two variables

We would like to plot Sepal.Length versus Petal.Length. We’ll first do this using base R functions.

plot( x = iris_full$Sepal.Length, y = iris_full$Petal.Length )

Note, we could save a little typing by using the with function. This sets up a temporary environment where all of the variables (columns) in my dataset are defined individually.

with( data = iris_full, plot( x = Sepal.Length, y = Petal.Length ) )

Visualization using the ggplot2 package

Now we’re going to introduce a data visualization package called ggplot2. This package is great for producing publication quality graphics, but the syntax used to create a plot is a little more involved.

Aside: Installing and loading packages

First, let’s install the ggplot2 package:

# Only need to do this once
install.packages("ggplot2")

Then load it:

library(ggplot2)

OK, let’s fire up ggplot.

ggplot() + # First make an empty ggplot canvas. Notice the trailing plus sign
  geom_point( data = iris_full, aes( x = Sepal.Length, y = Petal.Length ) )

Let’s break down that call to introduce a few key things about ggplot

  • ggplot: the initial canvas we’re working on
  • geom: geometric objects (i.e. the type of plot - histogram, points, line, etc)
  • aes: aesthetic mapping

THAT SEEMS SO COMPLICATED!

It’s true, for simple plots, ggplot can be much more complicated than simply using the base functions. But the power of ggplot lies in the ability to lay several geometries (geoms) over each other. Also, each geometry has a rich set of options. For example, let’s say I want to create the plot we just made, but have each species represented by a different color.

ggplot() + # First make an empty ggplot canvas. Notice the trailing plus sign
  geom_point( data = iris_full, aes( x = Sepal.Length, y = Petal.Length, colour = Species ) )

Let’s add more information - how about habitat type as well.

ggplot() + # First make an empty ggplot canvas. Notice the trailing plus sign
  geom_point( data = iris_full, aes( x = Sepal.Length, y = Petal.Length, colour = Species, shape = Habitat), size = 2.5 ) +
  theme_bw()

facets - a way to separate data into different subplots

Let’s say we wanted different plots for each species. We can do that in ggplot using facets.

ggplot( data = iris_full, aes( x = Sepal.Length, y = Petal.Length ) ) + 
  geom_point() +
  facet_grid( Species ~ . )

NOTE: I moved the data ... stuff into the initial ggplot call here.


Challange

  1. ggplot2 has many geometries, allowing us to make lot’s of different types of plots. Let’s make two new plots - one boxplot of Petal.Length, with one boxplot for each species. Use geom_boxplot for this.
ggplot(data = iris_full, aes(x = Species, y = Petal.Length)) +
  geom_boxplot(aes(fill = Species)) +
  theme_bw()  

  1. Make a histogram of Petal.width.
ggplot() +
  geom_histogram(data = iris_full, aes(x = Petal.Length)) +
  theme_bw()  
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Use facets to separate the three species.

ggplot() +
  geom_histogram(data = iris_full, aes(x = Petal.Length, fill = Species)) +
  facet_grid( Species ~ . ) +
  theme_bw()  
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Second, let’s make density plots of Petal.Width. Use geom_density and colour to make different colored density lines for iris in each habitat type. Note: the area under the curve is equal to 1.
ggplot() +
  geom_density(data = iris_full, aes(x = Petal.Length, colour = Species)) +
  theme_bw()

  1. Use histogram to plot density instead of counts.
ggplot() +
  geom_histogram(data = iris_full, 
                 aes(x = Petal.Length, y = ..density.., fill = Species)) +
  facet_grid( Species ~ . ) +
  geom_density(data = iris_full, aes(x = Petal.Length, colour = Species)) +
  theme_bw()  
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


More ggplot2 resources



  1. The value should be 0.24. Why?