At the end of this session you should be able to:
For this session, you can either work in RStudio on your own computer (if you have RStudio installed) or you can work in an RStudio Cloud session. To start an RStudio Cloud session go to https://rstudio.cloud/ and log in using your GitHub account.
In this session, we will be working primarily in RStudio. So what is the difference between R and RStudio? R is a programming language (specifically a statistical analysis programming language), while RStudio is an Integrated Development Environment, or IDE. You can think of it as a interface with R that aids the analyst. RStudio offers a number of features, mostly related to visual presentation of information, that make writing and working with R code easier.
There are four panels in the RStudio interface (though you may only have three open when you first start it), each has valuable information.
Good file management is important in data science. We’ll start working on this now.
Wallace-Invasives
directory / folderdata
directory / folder within Wallace-Invasives
scripts
directory / folder within Wallace-Invasives
Wallace-Invasives
directoryPoint-and-click method - Use ‘Session’ > ‘Set Working Directory’ > ‘Choose Directory’.
Using the R Console:
setwd("/Users/maiellolammens/Wallace-Invasives/")
Let’s make a new R Project associated with your Wallace-Invasives directory. It will become more clear why we are doing this later when we get to the Git lesson. To make a new project, got to the upper right-hand side of the RStudio interface, where it says Project: (None). Click the little downward arrow, select “New Project”, then select “Existing Directory” from the window that pops up. Use the graphical user interface (GUI) to navigate to the Wallace-Invasives directory, then select “Create Project”.
help.search
help.search("bar plot")
Use the help.search
function to search for something in statistics that you think should be in R? Did you find anything?
?barplot
We can use R just like any other calculator.
3 + 5
## [1] 8
There’s internal control for order of operations (Please Excuse My Dear Aunt Sally)
(3 * 5) + 7
## [1] 22
3 * 5 + 7
## [1] 22
Write an example where adding parentheses matters.
There are a ton of internal functions, and a lot of add-ons.
sqrt(4)
## [1] 2
abs(-5)
## [1] 5
sqrt(-5)
## Warning in sqrt(-5): NaNs produced
## [1] NaN
Use a script file for your work. It’s easier to go back to and easy to document.
Important: within an R file, you can use the # sign to add comments. Anything written after the # is not interpreted when you run the code.
Create a new R script file in your scripts
directory.
# What working directory am I in?
getwd()
## [1] "/Users/maiellolammens/Google Drive/Professional/Workshops-and-Short-Courses-Taught/NAISMA-2019/docs"
# Move to a different director?
setwd(".")
fil
, what do you find?file
, and describe what you think it does.There are several basic types of data structures in R.
These functions will tell you what kind of variable you are dealing with, as well as some additional information which may be useful as you advance in your use of R.
str()
class()
Let’s define a variable.
my_var <- 8
And another
my_var2 <- 10
Work with vars
my_var + my_var2
## [1] 18
Make a new variable
my_var_tot <- my_var + my_var2
Change the value of my_var2
my_var2 <- 3
What is the value of my_var_tot
now?
Let’s combine multiple values into a vector of length greater than 1.
# Vector of variables
my_vect <- c(my_var, my_var2)
# Numeric vector
v1 <- c(10, 2, 8, 7, 11, 15)
# Char vector
pets <- c("cat", "dog", "rabbit", "pig")
Making a vector of numbers in sequence
v2 <- 1:10
v3 <- seq(from = 1, to = 10)
seq
function, and use this to make a vector from 1 to 100, by steps of 5.length.out
argument.You can get specific elements from vectors and other data structures
[]
pets <- c("cat", "dog", "rabbit", "pig", "snake")
pets[1]
## [1] "cat"
pets[3:4]
## [1] "rabbit" "pig"
pets[c(1,4)]
## [1] "cat" "pig"
Review - Why might we want 2D data?
Let’s make a matrix
With the people next to you, break down this function, and describe each argument. What is the final product?
my_mat <- matrix(data = runif(50), nrow = 10, byrow = TRUE)
What does it mean to fill byrow
?
matrix(data = 1:9, nrow = 3, byrow = TRUE)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
Versus
matrix(data = 1:9, nrow = 3, byrow = FALSE)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
What is the default value for byrow
?
Indexing happens by row, column notation.
my_mat <- matrix(data = 1:50, nrow = 10, byrow = TRUE)
my_mat[1,1]
## [1] 1
my_mat[1,2]
## [1] 2
my_mat[2,1]
## [1] 6
my_mat[1:4, 1:3]
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 6 7 8
## [3,] 11 12 13
## [4,] 16 17 18
my_mat[c(1,3,5), ]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 11 12 13 14 15
## [3,] 21 22 23 24 25
my_mat[ ,c(1,3,5)]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 6 8 10
## [3,] 11 13 15
## [4,] 16 18 20
## [5,] 21 23 25
## [6,] 26 28 30
## [7,] 31 33 35
## [8,] 36 38 40
## [9,] 41 43 45
## [10,] 46 48 50
Make a “random” matrix (that isn’t random in this case because of the set.seed
function)
set.seed(1)
mat1 <- matrix(data = runif(50), nrow = 10, byrow = TRUE)
Calculate the mean of all of the data
mean(mat1)
## [1] 0.5325929
Calculate the standard deviation of all of the data
sd(mat1)
## [1] 0.272239
Calculate row means and column means
rowMeans(mat1)
## [1] 0.4640751 0.6389526 0.4446999 0.6729408 0.4382595 0.3983864 0.5177556
## [8] 0.5411271 0.6667390 0.5429926
colMeans(mat1)
## [1] 0.5949241 0.4500704 0.5808290 0.5491985 0.4879423
Introduce the apply
function
apply(mat1, MARGIN = 1, mean)
## [1] 0.4640751 0.6389526 0.4446999 0.6729408 0.4382595 0.3983864 0.5177556
## [8] 0.5411271 0.6667390 0.5429926
One of the most basic things you will need to do in R is read in your own data set. You can read in Excel files, simple text files, and even files from Google Sheets. But the easiest type of file to read in is a comma separated values (CSV) file. You can save an Excel workbook (or Numbers workbook or Google Sheet) as a CSV file by using the “Save as …” menu item.
Let’s read in the soil data from our Speed Data Science activity.
soil_data <- read.csv("https://mlammens.github.io/Wallace-NAISMA-2019/docs/soil_nutrient_data.csv")
Let’s have a brief look at this data set.
head(soil_data)
## date condition site replicate pH nitrate_lbacre
## 1 3/29/18 wet townhouse 1 6.5 10
## 2 3/29/18 wet townhouse 2 7.0 10
## 3 3/29/18 wet townhouse 3 6.2 10
## 4 3/29/18 wet townhouse 4 7.1 10
## 5 3/29/18 wet townhouse 5 6.6 10
## 6 3/29/18 wet osa_top 1 7.0 10
## phosphorus_lbacre
## 1 10.0
## 2 10.0
## 3 17.5
## 4 25.0
## 5 25.0
## 6 25.0
tail(soil_data)
## date condition site replicate pH nitrate_lbacre
## 20 3/29/18 wet wetland_top 5 6.2 10
## 21 3/29/18 wet wetland_bot 1 6.8 15
## 22 3/29/18 wet wetland_bot 2 6.3 10
## 23 3/29/18 wet wetland_bot 3 6.4 5
## 24 3/29/18 wet wetland_bot 4 6.4 10
## 25 3/29/18 wet wetland_bot 5 6.7 10
## phosphorus_lbacre
## 20 10
## 21 10
## 22 10
## 23 10
## 24 10
## 25 5
As we saw in our previous activity, these data include nutrient measurements at multiple different sites on a suburban campus. The data set also includes information about the data samples were collected and the conditions of the site during collection.
summary
functionLet’s begin by using the summary
function to examine this data set. summary
returns many of the standard statistics. When doing data exploration, a few things you want to look at are:
summary(soil_data)
## date condition site replicate pH
## 3/29/18:25 wet:25 osa_bot :5 Min. :1 Min. :6.000
## osa_top :5 1st Qu.:2 1st Qu.:6.200
## townhouse :5 Median :3 Median :6.600
## wetland_bot:5 Mean :3 Mean :6.564
## wetland_top:5 3rd Qu.:4 3rd Qu.:6.800
## Max. :5 Max. :7.200
## nitrate_lbacre phosphorus_lbacre
## Min. : 5.0 Min. : 5.0
## 1st Qu.: 5.0 1st Qu.: 5.0
## Median :10.0 Median :10.0
## Mean : 9.4 Mean :12.3
## 3rd Qu.:10.0 3rd Qu.:17.5
## Max. :20.0 Max. :25.0
The data type factors ‘appear’ to be similar to characters, but are in fact coded numerically in R. Think of factors like categories. Here’s a quick example that demonstrates the difference in these two variable types that shows up when using summary
.
## Make a new soil dataset
soil_data_new <- soil_data
## Create a new column that treats species as a character, rather than a factor
soil_data_new$site_char <- as.character(soil_data_new$site)
## Run summary command
summary(soil_data_new)
## date condition site replicate pH
## 3/29/18:25 wet:25 osa_bot :5 Min. :1 Min. :6.000
## osa_top :5 1st Qu.:2 1st Qu.:6.200
## townhouse :5 Median :3 Median :6.600
## wetland_bot:5 Mean :3 Mean :6.564
## wetland_top:5 3rd Qu.:4 3rd Qu.:6.800
## Max. :5 Max. :7.200
## nitrate_lbacre phosphorus_lbacre site_char
## Min. : 5.0 Min. : 5.0 Length:25
## 1st Qu.: 5.0 1st Qu.: 5.0 Class :character
## Median :10.0 Median :10.0 Mode :character
## Mean : 9.4 Mean :12.3
## 3rd Qu.:10.0 3rd Qu.:17.5
## Max. :20.0 Max. :25.0
What am I going to get if I execute the command below?
head(soil_data[c("site","phosphorus_lbacre")])
There are many data visualization tools in R, and a very rich ecosystem of add-on packages that make it easy to create publication ready figures. Here we will learn a few basic visualization tools.
Let’s make a histogram of some of the nutrient data
hist(soil_data$pH, breaks = 10)
Next let’s make a scatter plot, showing pH versus phosphorus.
plot(x = soil_data$pH, y = soil_data$phosphorus_lbacre)
Lastly, let’s make a box plot. Note that the syntax of our function call is a bit different than above. Here we are using the ~
to say “make phosphorus a function of site”.
boxplot(soil_data$phosphorus_lbacre ~ soil_data$site)
Using R, remake one of the plots your group worked on during the Speed Data Science activity.
This lesson utilizes live coding. While the learners should be made aware of the existence of this document, both the instructor and the learner should work on all coding as if this document was being created from scratch. Also, the instructor and students should both be working in a new *.R script file.
At the end of this session you should be able to:
Assessment for live coding is based on the learners performance on successfully carrying out Challenges. This is primarily a formative assessment. A summative assessment could be incorporated as an in-class analysis lab or as part of a homework assignment.
Live-coding using a data set relevant to the course content. In this case, these data could be associated with a Plant Ecology course or an undergraduate research course.