Section Goals

The first step to any data analysis is to interrogate your data by calculating some standard statistics and by visualizing it in various ways. In this section, we’ll focus on data visualization.

Setting up our data

For the next section, we’re going to continue to work with the iris dataset.

Before continuing with visualization, let’s add a habitat type variable to the iris data set. We’ll use this later. Caveat - I made up these habitat type preferences.

iris_habitat <- data.frame( Species = c( "setosa", "versicolor", "virginica" ),
                            Habitat = c( "forest", "wetland", "meadow" ) )

iris_full <- merge( x = iris, y = iris_habitat, by = "Species" )

head( iris_full )
##   Species Sepal.Length Sepal.Width Petal.Length Petal.Width Habitat
## 1  setosa          5.1         3.5          1.4         0.2  forest
## 2  setosa          4.9         3.0          1.4         0.2  forest
## 3  setosa          4.7         3.2          1.3         0.2  forest
## 4  setosa          4.6         3.1          1.5         0.2  forest
## 5  setosa          5.0         3.6          1.4         0.2  forest
## 6  setosa          5.4         3.9          1.7         0.4  forest
tail( iris_full )
##       Species Sepal.Length Sepal.Width Petal.Length Petal.Width Habitat
## 145 virginica          6.7         3.3          5.7         2.5  meadow
## 146 virginica          6.7         3.0          5.2         2.3  meadow
## 147 virginica          6.3         2.5          5.0         1.9  meadow
## 148 virginica          6.5         3.0          5.2         2.0  meadow
## 149 virginica          6.2         3.4          5.4         2.3  meadow
## 150 virginica          5.9         3.0          5.1         1.8  meadow

Visualization using the ggplot2 package

I’m going to introduce a data visualization package called ggplot2. This package is great for producing publication quality graphics, but the syntax used to create a plot is a little more involved than base R (i.e., the graphics package).

Aside: Installing and loading packages

First, we need to install the ggplot2 package:

# Only need to do this once
install.packages("ggplot2")

Then load it:

library(ggplot2)

NOTE: You will only have to intall a new package once, but you will need to call the library function at the beginning of every new R session.

Visualizing the measurements of a single variable

Perhaps the most common way to look at data for a single variable is a histogram. This essentially is a bar plot, where each bar represents the number of times a value falls within a particular bin.

Example: let’s look at the distribution of Petal.Length values

ggplot(data = iris_full, aes(x = Petal.Length)) +
  geom_histogram() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s break down this call to introduce a few key things about ggplot

  • ggplot: the initial canvas we’re working on
  • geom: geometric objects (i.e. the type of plot - histogram, points, line, etc)
  • aes: aesthetic mapping

Challenge - use ?geom_histogram to determine how to change the number of bins used


Visualizing relationships between two variables

Now, let’s make an x-y scatter plot. Specifically, we would like to plot Sepal.Length versus Petal.Length.

ggplot(data = iris_full, aes( x = Sepal.Length, y = Petal.Length )) + 
  geom_point(  )

THAT SEEMS SO COMPLICATED!

It’s true. The syntax for ggplot can seem pretty complicated. But the power of ggplot lies in the ability to lay several geometries (geoms) over each other. Also, each geometry has a rich set of options. For example, let’s say I want to create the plot we just made, but have each species represented by a different color.

ggplot(data = iris_full, aes( x = Sepal.Length, y = Petal.Length, colour = Species ) ) + 
  geom_point( )

Let’s add more information - how about habitat type as well.

ggplot(data = iris_full, aes( x = Sepal.Length, y = Petal.Length, colour = Species, shape = Habitat)) + 
  geom_point(size = 2.5 ) +
  theme_bw()

How about if we wanted to add trend lines.

ggplot(data = iris_full, aes( x = Sepal.Length, y = Petal.Length, colour = Species)) + 
  geom_point(size = 2.5 ) +
  geom_smooth(method = "lm") +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

facets - a way to separate data into different subplots

Let’s say we wanted different plots for each species. We can do that in ggplot using facets.

ggplot( data = iris_full, aes( x = Sepal.Length, y = Petal.Length ) ) + 
  geom_point() +
  facet_grid( Species ~ . )


Challenge

  1. ggplot2 has many geometries, allowing us to make lot’s of different types of plots. Let’s make two new plots - one boxplot of Petal.Length, with one boxplot for each species. Use geom_boxplot for this.
ggplot(data = iris_full, aes(x = Species, y = Petal.Length, fill = Species)) +
  geom_boxplot() +
  theme_bw()  

  1. Make a histogram of Petal.width.
ggplot(data = iris_full, aes(x = Petal.Length)) +
  geom_histogram() +
  theme_bw()  
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Use facets to separate the three species.

ggplot(data = iris_full, aes(x = Petal.Length, fill = Species)) +
  geom_histogram() +
  facet_grid( Species ~ . ) +
  theme_bw()  
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Second, let’s make density plots of Petal.Width. Use geom_density and colour to make different colored density lines for iris in each habitat type. Note: the area under the curve is equal to 1.
ggplot(data = iris_full, aes(x = Petal.Length, colour = Species)) +
  geom_density() +
  theme_bw()

  1. Use histogram to plot density instead of counts.
ggplot(data = iris_full, aes(x = Petal.Length)) +
  geom_histogram(aes(y = ..density.., fill = Species)) +
  facet_grid( Species ~ . ) +
  geom_density(aes(colour = Species)) +
  theme_bw()  
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


More ggplot2 resources