Learning Outcomes

At the end of this session you should be able to:

Getting started

For this session, you can either work in RStudio on your own computer (if you have RStudio installed) or you can work in an RStudio Cloud session. To start an RStudio Cloud session go to https://rstudio.cloud/ and log in using your GitHub account.

Difference between R and RStudio

In this session, we will be working primarily in RStudio. So what is the difference between R and RStudio? R is a programming language (specifically a statistical analysis programming language), while RStudio is an Integrated Development Environment, or IDE. You can think of it as a interface with R that aids the analyst. RStudio offers a number of features, mostly related to visual presentation of information, that make writing and working with R code easier.

Overall layout

There are four panels in the RStudio interface (though you may only have three open when you first start it), each has valuable information.

  • Console / Terminal panel (lower-left)
  • Environment / History / Git (upper-right)
  • Files / Plots / Packages / Help (lower-right)
  • Source / Editor (upper-left)

File management

Good file management is important in data science. We’ll start working on this now.

  • Make a ESA-2019-DataSci directory / folder
  • Make a data directory / folder within ESA-2019-DataSci
  • Make a scripts directory / folder within ESA-2019-DataSci
  • In RStudio, set working directory to your new ESA-2019-DataSci directory

Setting your working directory

Point-and-click method - Use ‘Session’ > ‘Set Working Directory’ > ‘Choose Directory’.

Using the R Console:

setwd("/Users/maiellolammens/ESA-2019-DataSci/")

Making an R Project

Let’s make a new R Project associated with your ESA-2019-DataSci directory. It will become more clear why we are doing this later when we get to the Git lesson. To make a new project, got to the upper right-hand side of the RStudio interface, where it says Project: (None). Click the little downward arrow, select “New Project”, then select “Existing Directory” from the window that pops up. Use the graphical user interface (GUI) to navigate to the ESA-2019-DataSci directory, then select “Create Project”.

Getting help

  • Help panel (lower right corner)
  • help.search
help.search("bar plot")

Challenge

Use the help.search function to search for something in statistics that you think should be in R? Did you find anything?

  • I know my function - just give me the details - ?barplot

R as calculator

We can use R just like any other calculator.

3 + 5
## [1] 8

There’s internal control for order of operations (Please Excuse My Dear Aunt Sally)

(3 * 5) + 7
## [1] 22
3 * 5 + 7
## [1] 22

Challenge

Write an example where adding parentheses matters.

Internal functions

There are a ton of internal functions, and a lot of add-ons.

sqrt(4)
## [1] 2
abs(-5)
## [1] 5
sqrt(-5)
## Warning in sqrt(-5): NaNs produced
## [1] NaN

R script file

Use a script file for your work. It’s easier to go back to and easy to document.

Important: within an R file, you can use the # sign to add comments. Anything written after the # is not interpreted when you run the code.

Challenge

Create a new R script file in your scripts directory.

Basic file managment in R

# What working directory am I in?
getwd()
## [1] "/Users/maiellolammens/Google Drive/Professional/Workshops-and-Short-Courses-Taught/ESA-2019-BEDENet/docs"
# Move to a different director?
setwd(".")

Additional file management points

  • Navigating the file path
  • Tab completion of file paths
  • Tab completion of R commands

Challenge

  • Try to auto-complete fil, what do you find?
  • Use the brief help menu that comes up to find a function that starts with file, and describe what you think it does.

Variables and objects

There are several basic types of data structures in R.

Functions that are useful for understanding the different types of data structures

These functions will tell you what kind of variable you are dealing with, as well as some additional information which may be useful as you advance in your use of R.

str()
class()

Practice with variables

Let’s define a variable.

my_var <- 8

And another

my_var2 <- 10

Work with vars

my_var + my_var2
## [1] 18

Make a new variable

my_var_tot <- my_var + my_var2

Challenge

Change the value of my_var2

my_var2 <- 3

What is the value of my_var_tot now?

Make a vector

Let’s combine multiple values into a vector of length greater than 1.

# Vector of variables
my_vect <- c(my_var, my_var2)

# Numeric vector
v1 <- c(10, 2, 8, 7, 11, 15)

# Char vector
pets <- c("cat", "dog", "rabbit", "pig")

Making a vector of numbers in sequence

v2 <- 1:10
v3 <- seq(from = 1, to = 10)

Challenge

  1. Look up the help for the seq function, and use this to make a vector from 1 to 100, by steps of 5.
  2. Come up with a way that you would use the length.out argument.

Exploring variable elements

You can get specific elements from vectors and other data structures

pets <- c("cat", "dog", "rabbit", "pig", "snake")
pets[1]
## [1] "cat"
pets[3:4]
## [1] "rabbit" "pig"
pets[c(1,4)]
## [1] "cat" "pig"

Working with matrices

Review - Why might we want 2D data?

Let’s make a matrix

Challenge

With the people next to you, break down this function, and describe each argument. What is the final product?

my_mat <- matrix(data = runif(50), nrow = 10, byrow = TRUE)

What does it mean to fill byrow?

matrix(data = 1:9, nrow = 3, byrow = TRUE)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Versus

matrix(data = 1:9, nrow = 3, byrow = FALSE)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Challenge

What is the default value for byrow?

Indexing matrices

Indexing happens by row, column notation.

my_mat <- matrix(data = 1:50, nrow = 10, byrow = TRUE)

my_mat[1,1]
## [1] 1
my_mat[1,2]
## [1] 2
my_mat[2,1]
## [1] 6
my_mat[1:4, 1:3]
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    6    7    8
## [3,]   11   12   13
## [4,]   16   17   18
my_mat[c(1,3,5), ]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]   11   12   13   14   15
## [3,]   21   22   23   24   25
my_mat[ ,c(1,3,5)]
##       [,1] [,2] [,3]
##  [1,]    1    3    5
##  [2,]    6    8   10
##  [3,]   11   13   15
##  [4,]   16   18   20
##  [5,]   21   23   25
##  [6,]   26   28   30
##  [7,]   31   33   35
##  [8,]   36   38   40
##  [9,]   41   43   45
## [10,]   46   48   50

Combinining internal functions with matrices

Make a “random” matrix (that isn’t random in this case because of the set.seed function)

set.seed(1)
mat1 <- matrix(data = runif(50), nrow = 10, byrow = TRUE)

Calculate the mean of all of the data

mean(mat1)
## [1] 0.5325929

Calculate the standard deviation of all of the data

sd(mat1)
## [1] 0.272239

Calculate row means and column means

rowMeans(mat1)
##  [1] 0.4640751 0.6389526 0.4446999 0.6729408 0.4382595 0.3983864 0.5177556
##  [8] 0.5411271 0.6667390 0.5429926
colMeans(mat1)
## [1] 0.5949241 0.4500704 0.5808290 0.5491985 0.4879423

Introduce the apply function

apply(mat1, MARGIN = 1, mean)
##  [1] 0.4640751 0.6389526 0.4446999 0.6729408 0.4382595 0.3983864 0.5177556
##  [8] 0.5411271 0.6667390 0.5429926

Data frames

Reading in your own data

One of the most basic things you will need to do in R is read in your own data set. You can read in Excel files, simple text files, and even files from Google Sheets. But the easiest type of file to read in is a comma separated values (CSV) file. You can save an Excel workbook (or Numbers workbook or Google Sheet) as a CSV file by using the “Save as …” menu item.

Let’s read in the soil data from our Speed Data Science activity.

soil_data <- read.csv("https://mlammens.github.io/ESA2019-BEDENet/docs/soil_nutrient_data.csv")

Let’s have a brief look at this data set.

head(soil_data)
##      date condition      site replicate  pH nitrate_lbacre
## 1 3/29/18       wet townhouse         1 6.5             10
## 2 3/29/18       wet townhouse         2 7.0             10
## 3 3/29/18       wet townhouse         3 6.2             10
## 4 3/29/18       wet townhouse         4 7.1             10
## 5 3/29/18       wet townhouse         5 6.6             10
## 6 3/29/18       wet   osa_top         1 7.0             10
##   phosphorus_lbacre
## 1              10.0
## 2              10.0
## 3              17.5
## 4              25.0
## 5              25.0
## 6              25.0
tail(soil_data)
##       date condition        site replicate  pH nitrate_lbacre
## 20 3/29/18       wet wetland_top         5 6.2             10
## 21 3/29/18       wet wetland_bot         1 6.8             15
## 22 3/29/18       wet wetland_bot         2 6.3             10
## 23 3/29/18       wet wetland_bot         3 6.4              5
## 24 3/29/18       wet wetland_bot         4 6.4             10
## 25 3/29/18       wet wetland_bot         5 6.7             10
##    phosphorus_lbacre
## 20                10
## 21                10
## 22                10
## 23                10
## 24                10
## 25                 5

As we saw in our previous activity, these data include nutrient measurements at multiple different sites on a suburban campus. The data set also includes information about the data samples were collected and the conditions of the site during collection.

summary function

Let’s begin by using the summary function to examine this data set. summary returns many of the standard statistics. When doing data exploration, a few things you want to look at are:

  • How do the mean and median values within a variable compare?
  • Do the min and max values suggest there are outliers?
  • Which variables (i.e., columns) are quantitative (numeric) versus categorical (factors or characters)
summary(soil_data)
##       date    condition          site     replicate       pH       
##  3/29/18:25   wet:25    osa_bot    :5   Min.   :1   Min.   :6.000  
##                         osa_top    :5   1st Qu.:2   1st Qu.:6.200  
##                         townhouse  :5   Median :3   Median :6.600  
##                         wetland_bot:5   Mean   :3   Mean   :6.564  
##                         wetland_top:5   3rd Qu.:4   3rd Qu.:6.800  
##                                         Max.   :5   Max.   :7.200  
##  nitrate_lbacre phosphorus_lbacre
##  Min.   : 5.0   Min.   : 5.0     
##  1st Qu.: 5.0   1st Qu.: 5.0     
##  Median :10.0   Median :10.0     
##  Mean   : 9.4   Mean   :12.3     
##  3rd Qu.:10.0   3rd Qu.:17.5     
##  Max.   :20.0   Max.   :25.0

Aside: Differences between characters and factors

The data type factors ‘appear’ to be similar to characters, but are in fact coded numerically in R. Think of factors like categories. Here’s a quick example that demonstrates the difference in these two variable types that shows up when using summary.

## Make a new iris dataset
soil_data_new <- soil_data

## Create a new column that treats species as a character, rather than a factor
soil_data_new$site_char <- as.character(soil_data_new$site)

## Run summary command
summary(soil_data_new)
##       date    condition          site     replicate       pH       
##  3/29/18:25   wet:25    osa_bot    :5   Min.   :1   Min.   :6.000  
##                         osa_top    :5   1st Qu.:2   1st Qu.:6.200  
##                         townhouse  :5   Median :3   Median :6.600  
##                         wetland_bot:5   Mean   :3   Mean   :6.564  
##                         wetland_top:5   3rd Qu.:4   3rd Qu.:6.800  
##                                         Max.   :5   Max.   :7.200  
##  nitrate_lbacre phosphorus_lbacre  site_char        
##  Min.   : 5.0   Min.   : 5.0      Length:25         
##  1st Qu.: 5.0   1st Qu.: 5.0      Class :character  
##  Median :10.0   Median :10.0      Mode  :character  
##  Mean   : 9.4   Mean   :12.3                        
##  3rd Qu.:10.0   3rd Qu.:17.5                        
##  Max.   :20.0   Max.   :25.0

Aside: A (very) brief introduction to navigating a data.frame

We will be very brief here. I recommend checking out this Data Carpentry lesson for more information.

  • Looking at specific data.frame elements. Use the row and column notation.

Here is the 5th row, 3rd column (Site). Note: We are using square brackets to index the data.frame and we always use row, column notation.

soil_data[5, 3]
## [1] townhouse
## Levels: osa_bot osa_top townhouse wetland_bot wetland_top
  • Looking at an entire column.

Here are two ways to get the pH column.

First, note: we leave the row part blank, but still add the comma.

soil_data[ ,5]
##  [1] 6.5 7.0 6.2 7.1 6.6 7.0 6.8 6.8 6.2 6.6 6.2 6.6 6.5 7.0 7.2 6.0 6.8
## [18] 6.0 6.2 6.2 6.8 6.3 6.4 6.4 6.7

Second, use only the variable (column) name. Note the use of the $ operator

soil_data$pH
##  [1] 6.5 7.0 6.2 7.1 6.6 7.0 6.8 6.8 6.2 6.6 6.2 6.6 6.5 7.0 7.2 6.0 6.8
## [18] 6.0 6.2 6.2 6.8 6.3 6.4 6.4 6.7
  • Looking at specific column entry

This is another way to look at the 5th entry in the pH column.

soil_data$pH[5]
## [1] 6.6
  • Looking at all entries for a given row.

Here’s all the entries for the 5th row. Note: here we leave the column part blank, but still add the comma.

soil_data[5, ]
##      date condition      site replicate  pH nitrate_lbacre
## 5 3/29/18       wet townhouse         5 6.6             10
##   phosphorus_lbacre
## 5                25
  • Looking at a set of rows and/or columns.

Here’s all the entries in the 5th through 10th rows, 5th through 7th columns. Note: we use the : operator to look at a range of value.

soil_data[5:10, 5:7]
##     pH nitrate_lbacre phosphorus_lbacre
## 5  6.6             10                25
## 6  7.0             10                25
## 7  6.8              5                25
## 8  6.8              5                25
## 9  6.2             10                25
## 10 6.6              5                10
  • For data.frames, if you do not use row, column notation, you will get only the columns back.
head(soil_data[5:7])
##    pH nitrate_lbacre phosphorus_lbacre
## 1 6.5             10              10.0
## 2 7.0             10              10.0
## 3 6.2             10              17.5
## 4 7.1             10              25.0
## 5 6.6             10              25.0
## 6 7.0             10              25.0

Challenge

What am I going to get if I execute the command below?

head(soil_data[c("site","phosphorus_lbacre")])

Basic data visualization

There are many data visualization tools in R, and a very rich ecosystem of add-on packages that make it easy to create publication ready figures. Here we will learn a few basic visualization tools.

Historgrams

Let’s make a histogram of some of the nutrient data

hist(soil_data$pH, breaks = 10)

Next let’s make a scatter plot, showing pH versus phosphorus.

plot(x = soil_data$pH, y = soil_data$phosphorus_lbacre)

Lastly, let’s make a box plot. Note that the syntax of our function call is a bit different than above. Here we are using the ~ to say “make phosphorus a function of site”.

boxplot(soil_data$phosphorus_lbacre ~ soil_data$site)


Challenge

Using R, remake one of the plots your group worked on during the Speed Data Science activity.


Instructor’s Notes

This lesson utilizes live coding. While the learners should be made aware of the existence of this document, both the instructor and the learner should work on all coding as if this document was being created from scratch. Also, the instructor and students should both be working in a new *.R script file.

Learning Outcomes

At the end of this session you should be able to:

  • Perform basic arithmetic using R/RStudio
  • Identify the major components of the RStudio interface
  • Create an R script file and use it to document a sequence of R functions
  • Read in a data set and perform basic data summary procedures in R

Assessment

Assessment for live coding is based on the learners performance on successfully carrying out Challenges. This is primarily a formative assessment. A summative assessment could be incorporated as an in-class analysis lab or as part of a homework assignment.

Learning Activities

Live-coding using a data set relevant to the course content. In this case, these data could be associated with a Plant Ecology course or an undergraduate research course.