The first step to any data analysis is to interrogate your data by calculating some standard statistics and by visualizing it in various ways. In this section, we’ll focus on data visualization.
ggplot2
libraryFor the next section, we’re going to continue to work with the iris dataset.
Before continuing with visualization, let’s add a habitat type variable to the iris data set. We’ll use this later. Caveat - I made up these habitat type preferences.
iris_habitat <- data.frame( Species = c( "setosa", "versicolor", "virginica" ),
Habitat = c( "forest", "wetland", "meadow" ) )
iris_full <- merge( x = iris, y = iris_habitat, by = "Species" )
head( iris_full )
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width Habitat
## 1 setosa 5.1 3.5 1.4 0.2 forest
## 2 setosa 4.9 3.0 1.4 0.2 forest
## 3 setosa 4.7 3.2 1.3 0.2 forest
## 4 setosa 4.6 3.1 1.5 0.2 forest
## 5 setosa 5.0 3.6 1.4 0.2 forest
## 6 setosa 5.4 3.9 1.7 0.4 forest
tail( iris_full )
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width Habitat
## 145 virginica 6.7 3.3 5.7 2.5 meadow
## 146 virginica 6.7 3.0 5.2 2.3 meadow
## 147 virginica 6.3 2.5 5.0 1.9 meadow
## 148 virginica 6.5 3.0 5.2 2.0 meadow
## 149 virginica 6.2 3.4 5.4 2.3 meadow
## 150 virginica 5.9 3.0 5.1 1.8 meadow
ggplot2
packageI’m going to introduce a data visualization package called
ggplot2. This package is great for producing
publication quality graphics, but the syntax used to create a plot is a
little more involved than base R (i.e., the graphics
package).
First, we need to install the ggplot2
package:
# Only need to do this once
install.packages("ggplot2")
Then load it:
library(ggplot2)
NOTE: You will only have to intall a new package once, but
you will need to call the library
function at the beginning
of every new R session.
Perhaps the most common way to look at data for a single variable is a histogram. This essentially is a bar plot, where each bar represents the number of times a value falls within a particular bin.
Example: let’s look at the distribution of Petal.Length
values
ggplot(data = iris_full, aes(x = Petal.Length)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s break down this call to introduce a few key things about ggplot
?geom_histogram
to determine how to
change the number of bins usedNow, let’s make an x-y scatter plot. Specifically, we would like to plot Sepal.Length versus Petal.Length.
ggplot(data = iris_full, aes( x = Sepal.Length, y = Petal.Length )) +
geom_point( )
THAT SEEMS SO COMPLICATED!
It’s true. The syntax for ggplot can seem pretty complicated. But the power of ggplot lies in the ability to lay several geometries (geoms) over each other. Also, each geometry has a rich set of options. For example, let’s say I want to create the plot we just made, but have each species represented by a different color.
ggplot(data = iris_full, aes( x = Sepal.Length, y = Petal.Length, colour = Species ) ) +
geom_point( )
Let’s add more information - how about habitat type as well.
ggplot(data = iris_full, aes( x = Sepal.Length, y = Petal.Length, colour = Species, shape = Habitat)) +
geom_point(size = 2.5 ) +
theme_bw()
How about if we wanted to add trend lines.
ggplot(data = iris_full, aes( x = Sepal.Length, y = Petal.Length, colour = Species)) +
geom_point(size = 2.5 ) +
geom_smooth(method = "lm") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
facets - a way to separate data into different subplots
Let’s say we wanted different plots for each species. We can do that
in ggplot using facets
.
ggplot( data = iris_full, aes( x = Sepal.Length, y = Petal.Length ) ) +
geom_point() +
facet_grid( Species ~ . )
ggplot2
has many geometries, allowing us to make lot’s
of different types of plots. Let’s make two new plots - one
boxplot of Petal.Length
, with one boxplot
for each species. Use geom_boxplot
for this.ggplot(data = iris_full, aes(x = Species, y = Petal.Length, fill = Species)) +
geom_boxplot() +
theme_bw()
ggplot(data = iris_full, aes(x = Petal.Length)) +
geom_histogram() +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Use facets to separate the three species.
ggplot(data = iris_full, aes(x = Petal.Length, fill = Species)) +
geom_histogram() +
facet_grid( Species ~ . ) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
geom_density
and colour
to make different
colored density lines for iris in each habitat type. Note: the
area under the curve is equal to 1.ggplot(data = iris_full, aes(x = Petal.Length, colour = Species)) +
geom_density() +
theme_bw()
ggplot(data = iris_full, aes(x = Petal.Length)) +
geom_histogram(aes(y = ..density.., fill = Species)) +
facet_grid( Species ~ . ) +
geom_density(aes(colour = Species)) +
theme_bw()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot2
resources