Section Goals


Before starting this section! It’s a good idea to re-start RStudio and clear your environment if there’s anything in there.

Variables and objects

There are several basic types of data structures in R.

Functions that are useful for understanding the different types of data structures

These functions will tell you what kind of variable you are dealing with, as well as some additional information which may be useful as you advance in your use of R.

str()
class()

Practice with variables

Variables (aka Objects)

Define a variable

my_var <- 8

And another

my_var2 <- 10

Work with vars

my_var + my_var2
## [1] 18

Make a new variable

my_var_tot <- my_var + my_var2

Challenge

Change the value of my_var2

my_var2 <- 3

What is the value of my_var_tot now?


Vectors

Define a variable

my_var <- 8

And another

my_var2 <- 10

Combining values into a vector

# Vector of variables
my_vect <- c(my_var, my_var2)

Here are some other examples of vectors

# Numeric vector
v1 <- c(10, 2, 8, 7, 11, 15)

# Char vector
pets <- c("cat", "dog", "rabbit", "pig")

Making a vector of numbers in sequence

v2 <- 1:10
v3 <- seq(from = 1, to = 10)

Challenge

  1. Look up the help for the seq function, and use this to make a vector from 1 to 100, by steps of 5.
  2. Come up with a way that you would use the length.out argument in the seq function.

Exploring variable elements

You can get specific elements from vectors and other data structures

Introduction to the square brackets []

Using the square bracket notation is sometimes referred to as indexing, or index notation.

pets <- c("cat", "dog", "rabbit", "pig", "snake")
pets[1]
## [1] "cat"
  • Getting a number of elements, in sequence
pets[3:4]
## [1] "rabbit" "pig"
  • Getting a number of elements, not in sequence
pets[c(1,4)]
## [1] "cat" "pig"

Working with matrices

Review - Why might we want 2D data?

Let’s make a matrix


Challenge

With a partner, break down this function, and describe each argument. What is the final product?

my_mat <- matrix(data = runif(50), nrow = 10, byrow = TRUE)

What does it mean to fill byrow?

matrix(data = 1:9, nrow = 3, byrow = TRUE)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Versus

matrix(data = 1:9, nrow = 3, byrow = FALSE)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Challenge

What is the default value for byrow?


Indexing matrices

Indexing happens by row, column notation.

my_mat <- matrix(data = 1:50, nrow = 10, byrow = TRUE)

my_mat[1,1]
## [1] 1
my_mat[1,2]
## [1] 2
my_mat[2,1]
## [1] 6
my_mat[1:4, 1:3]
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    6    7    8
## [3,]   11   12   13
## [4,]   16   17   18
my_mat[c(1,3,5), ]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]   11   12   13   14   15
## [3,]   21   22   23   24   25
my_mat[ ,c(1,3,5)]
##       [,1] [,2] [,3]
##  [1,]    1    3    5
##  [2,]    6    8   10
##  [3,]   11   13   15
##  [4,]   16   18   20
##  [5,]   21   23   25
##  [6,]   26   28   30
##  [7,]   31   33   35
##  [8,]   36   38   40
##  [9,]   41   43   45
## [10,]   46   48   50

Using internal functions on matrices

Make a “random” matrix (that isn’t random in this case because of the set.seed function)

set.seed(1)
mat1 <- matrix(data = runif(50), nrow = 10, byrow = TRUE)

Calculate the mean of all of the data

mean(mat1)
## [1] 0.5325929

Calculate the standard deviation of all of the data

sd(mat1)
## [1] 0.272239

Calculate row means and column means

rowMeans(mat1)
##  [1] 0.4640751 0.6389526 0.4446999 0.6729408 0.4382595 0.3983864 0.5177556
##  [8] 0.5411271 0.6667390 0.5429926
colMeans(mat1)
## [1] 0.5949241 0.4500704 0.5808290 0.5491985 0.4879423

Introduce the apply function

apply(mat1, MARGIN = 1, mean)
##  [1] 0.4640751 0.6389526 0.4446999 0.6729408 0.4382595 0.3983864 0.5177556
##  [8] 0.5411271 0.6667390 0.5429926

Data frames

Iris dataset

We’re going to work with a dataset that comes built in to R, commonly called the iris dataset. It is also sometimes called Fisher’s Iris dataset (but should more appropriately be called Anderson’s Iris dataset). Because it comes pre-packaged with R, we can load it into our environment using the data function.

data(iris)

Let’s have a look at this dataset

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
tail(iris)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

The dataset contains measurements of four characteristics of three different species of Iris (plants!).

summary function

Let’s begin by using the summary function to examine this data set. summary returns many of the standard statistics. When doing data exploration, a few things you want to look at are:

  • How do the mean and median values within a variable compare?
  • Do the min and max values suggest there are outliers?
  • Which variables (i.e., columns) are quantitative (numeric) versus categorical (factors or characters)
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Aside: Differences between characters and factors

Factors ‘appear’ to be similar to characters, but are in fact coded numerically in R. Think of factors like categories. Here’s a quick example that demonstrates the difference in these two variable types that shows up when using summary.

## Make a new iris dataset
iris_new <- iris

## Create a new column that treats species as a character, rather than a factor
iris_new$species_char <- as.character(iris_new$Species)

## Run summary command
summary(iris_new)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species   species_char      
##  setosa    :50   Length:150        
##  versicolor:50   Class :character  
##  virginica :50   Mode  :character  
##                                    
##                                    
## 

A (very) brief introduction to navigating a data.frame

We will be very brief here. I recommend checking out this Data Carpentry lesson for more information.

  • Looking at specific data.frame elements. Use the row and column notation.

Here is the 5th row, 3rd column (Petal.Length). Note: We are using square brackets to index the data.frame and we always use row, column notation.

iris[5, 3]
## [1] 1.4
  • Looking at an entire column.

Here are two ways to get the Petal.Length column.

First, note: we leave the row part blank, but still add the comma.

iris[ ,3]
##   [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
##  [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
##  [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
##  [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
##  [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
##  [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
## [109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
## [127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
## [145] 5.7 5.2 5.0 5.2 5.4 5.1

Second, use only the variable (column) name. Note the use of the $ operator

iris$Petal.Length
##   [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
##  [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
##  [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
##  [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
##  [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
##  [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
## [109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
## [127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
## [145] 5.7 5.2 5.0 5.2 5.4 5.1
  • Looking at specific column entry

This is another way to look at the 5th entry in the Petal.Length column.

iris$Petal.Length[5]
## [1] 1.4
  • Looking at all entries for a given row.

Here’s all the entries for the 5th row. Note: here we leave the column part blank, but still add the comma.

iris[5, ]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 5            5         3.6          1.4         0.2  setosa
  • Looking at a set of rows and/or columns.

Here’s all the entries in the 5th through 10th rows, 1st through 3rd columns. Note: we use the : operator to look at a range of value.

iris[5:10, 1:3]
##    Sepal.Length Sepal.Width Petal.Length
## 5           5.0         3.6          1.4
## 6           5.4         3.9          1.7
## 7           4.6         3.4          1.4
## 8           5.0         3.4          1.5
## 9           4.4         2.9          1.4
## 10          4.9         3.1          1.5
  • For data.frames, if you do not use row, column notation, you will get only the columns back.
head(iris[2:3])
##   Sepal.Width Petal.Length
## 1         3.5          1.4
## 2         3.0          1.4
## 3         3.2          1.3
## 4         3.1          1.5
## 5         3.6          1.4
## 6         3.9          1.7

Challenge

What am I going to get if I execute the command below?

head(iris[c("Sepal.Width","Petal.Length")])

Reading in your own data

One of the most basic things you will need to do in R is read in your own data set. You can read in Excel files, simple text files, and even files from Google Sheets. But the easiest type of file to read in is a comma separated values (CSV) file. You can save an Excel workbook (or Numbers workbook or Google Sheet) as a CSV file by using the “Save as …” menu item.

Let’s read in the soil data from past activities using the function read.csv. Note that in this example, we are reading this file using a URL that points to it, but you can just as easily read in a file on your local system (in fact, this is much more common!).

soil_data <- read.csv("https://mlammens.github.io/ESA-2021-BEDE-Network/docs/soil_nutrient_data.csv")

Let’s have a brief look at these data.

head(soil_data)
##      date condition      site replicate  pH nitrate_lbacre phosphorus_lbacre
## 1 3/29/18       wet townhouse         1 6.5             10              10.0
## 2 3/29/18       wet townhouse         2 7.0             10              10.0
## 3 3/29/18       wet townhouse         3 6.2             10              17.5
## 4 3/29/18       wet townhouse         4 7.1             10              25.0
## 5 3/29/18       wet townhouse         5 6.6             10              25.0
## 6 3/29/18       wet   osa_top         1 7.0             10              25.0
tail(soil_data)
##       date condition        site replicate  pH nitrate_lbacre phosphorus_lbacre
## 20 3/29/18       wet wetland_top         5 6.2             10                10
## 21 3/29/18       wet wetland_bot         1 6.8             15                10
## 22 3/29/18       wet wetland_bot         2 6.3             10                10
## 23 3/29/18       wet wetland_bot         3 6.4              5                10
## 24 3/29/18       wet wetland_bot         4 6.4             10                10
## 25 3/29/18       wet wetland_bot         5 6.7             10                 5

These data include nutrient measurements at multiple different sites on a suburban college campus. The data set also includes information about where the data samples were collected and the conditions of the site during collection.

Let’s begin by using the summary function to examine this data set. As above, let’s look for the following:

  • How do the mean and median values within a variable compare?
  • Do the min and max values suggest there are outliers?
  • Which variables (i.e., columns) are quantitative (numeric) versus categorical (factors or characters)
summary(soil_data)
##      date            condition             site             replicate
##  Length:25          Length:25          Length:25          Min.   :1  
##  Class :character   Class :character   Class :character   1st Qu.:2  
##  Mode  :character   Mode  :character   Mode  :character   Median :3  
##                                                           Mean   :3  
##                                                           3rd Qu.:4  
##                                                           Max.   :5  
##        pH        nitrate_lbacre phosphorus_lbacre
##  Min.   :6.000   Min.   : 5.0   Min.   : 5.0     
##  1st Qu.:6.200   1st Qu.: 5.0   1st Qu.: 5.0     
##  Median :6.600   Median :10.0   Median :10.0     
##  Mean   :6.564   Mean   : 9.4   Mean   :12.3     
##  3rd Qu.:6.800   3rd Qu.:10.0   3rd Qu.:17.5     
##  Max.   :7.200   Max.   :20.0   Max.   :25.0