data.frames
data.frame
object typeBefore starting this section! It’s a good idea to re-start RStudio and clear your environment if there’s anything in there.
There are several basic types of data structures in R.
These functions will tell you what kind of variable you are dealing with, as well as some additional information which may be useful as you advance in your use of R.
str()
class()
Define a variable
my_var <- 8
And another
my_var2 <- 10
Work with vars
my_var + my_var2
## [1] 18
Make a new variable
my_var_tot <- my_var + my_var2
Change the value of my_var2
my_var2 <- 3
What is the value of my_var_tot
now?
Define a variable
my_var <- 8
And another
my_var2 <- 10
Combining values into a vector
# Vector of variables
my_vect <- c(my_var, my_var2)
Here are some other examples of vectors
# Numeric vector
v1 <- c(10, 2, 8, 7, 11, 15)
# Char vector
pets <- c("cat", "dog", "rabbit", "pig")
Making a vector of numbers in sequence
v2 <- 1:10
v3 <- seq(from = 1, to = 10)
seq
function, and use this to
make a vector from 1 to 100, by steps of 5.length.out
argument in the seq
function.You can get specific elements from vectors and other data structures
[]
Using the square bracket notation is sometimes referred to as indexing, or index notation.
pets <- c("cat", "dog", "rabbit", "pig", "snake")
pets[1]
## [1] "cat"
pets[3:4]
## [1] "rabbit" "pig"
pets[c(1,4)]
## [1] "cat" "pig"
Review - Why might we want 2D data?
Let’s make a matrix
With a partner, break down this function, and describe each argument. What is the final product?
my_mat <- matrix(data = runif(50), nrow = 10, byrow = TRUE)
What does it mean to fill byrow
?
matrix(data = 1:9, nrow = 3, byrow = TRUE)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
Versus
matrix(data = 1:9, nrow = 3, byrow = FALSE)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
What is the default value for byrow
?
Indexing happens by row, column notation.
my_mat <- matrix(data = 1:50, nrow = 10, byrow = TRUE)
my_mat[1,1]
## [1] 1
my_mat[1,2]
## [1] 2
my_mat[2,1]
## [1] 6
my_mat[1:4, 1:3]
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 6 7 8
## [3,] 11 12 13
## [4,] 16 17 18
my_mat[c(1,3,5), ]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 11 12 13 14 15
## [3,] 21 22 23 24 25
my_mat[ ,c(1,3,5)]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 6 8 10
## [3,] 11 13 15
## [4,] 16 18 20
## [5,] 21 23 25
## [6,] 26 28 30
## [7,] 31 33 35
## [8,] 36 38 40
## [9,] 41 43 45
## [10,] 46 48 50
Make a “random” matrix (that isn’t random in this case because of the
set.seed
function)
set.seed(1)
mat1 <- matrix(data = runif(50), nrow = 10, byrow = TRUE)
Calculate the mean of all of the data
mean(mat1)
## [1] 0.5325929
Calculate the standard deviation of all of the data
sd(mat1)
## [1] 0.272239
Calculate row means and column means
rowMeans(mat1)
## [1] 0.4640751 0.6389526 0.4446999 0.6729408 0.4382595 0.3983864 0.5177556
## [8] 0.5411271 0.6667390 0.5429926
colMeans(mat1)
## [1] 0.5949241 0.4500704 0.5808290 0.5491985 0.4879423
Introduce the apply
function
apply(mat1, MARGIN = 1, mean)
## [1] 0.4640751 0.6389526 0.4446999 0.6729408 0.4382595 0.3983864 0.5177556
## [8] 0.5411271 0.6667390 0.5429926
We’re going to work with a dataset that comes built in to R, commonly
called the iris
dataset. It is also sometimes called Fisher’s Iris dataset (but
should more appropriately be called Anderson’s Iris dataset). Because it
comes pre-packaged with R, we can load it into our environment using the
data
function.
data(iris)
Let’s have a look at this dataset
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
The dataset contains measurements of four characteristics of three different species of Iris (plants!).
summary
functionLet’s begin by using the summary
function to examine
this data set. summary
returns many of the standard
statistics. When doing data exploration, a few things you want to look
at are:
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Factors ‘appear’ to be similar to characters, but are in fact coded
numerically in R. Think of factors like
categories. Here’s a quick example that demonstrates
the difference in these two variable types that shows up when using
summary
.
## Make a new iris dataset
iris_new <- iris
## Create a new column that treats species as a character, rather than a factor
iris_new$species_char <- as.character(iris_new$Species)
## Run summary command
summary(iris_new)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species species_char
## setosa :50 Length:150
## versicolor:50 Class :character
## virginica :50 Mode :character
##
##
##
One of the most basic things you will need to do in R is read in your own data set. You can read in Excel files, simple text files, and even files from Google Sheets. But the easiest type of file to read in is a comma separated values (CSV) file. You can save an Excel workbook (or Numbers workbook or Google Sheet) as a CSV file by using the “Save as …” menu item.
Let’s read in the soil data from past activities using the function
read.csv
. Note that in this example, we are reading this
file using a URL that points to it, but you can just as easily read in a
file on your local system (in fact, this is much more common!).
soil_data <- read.csv("https://mlammens.github.io/ESA-2021-BEDE-Network/docs/soil_nutrient_data.csv")
Let’s have a brief look at these data.
head(soil_data)
## date condition site replicate pH nitrate_lbacre phosphorus_lbacre
## 1 3/29/18 wet townhouse 1 6.5 10 10.0
## 2 3/29/18 wet townhouse 2 7.0 10 10.0
## 3 3/29/18 wet townhouse 3 6.2 10 17.5
## 4 3/29/18 wet townhouse 4 7.1 10 25.0
## 5 3/29/18 wet townhouse 5 6.6 10 25.0
## 6 3/29/18 wet osa_top 1 7.0 10 25.0
tail(soil_data)
## date condition site replicate pH nitrate_lbacre phosphorus_lbacre
## 20 3/29/18 wet wetland_top 5 6.2 10 10
## 21 3/29/18 wet wetland_bot 1 6.8 15 10
## 22 3/29/18 wet wetland_bot 2 6.3 10 10
## 23 3/29/18 wet wetland_bot 3 6.4 5 10
## 24 3/29/18 wet wetland_bot 4 6.4 10 10
## 25 3/29/18 wet wetland_bot 5 6.7 10 5
These data include nutrient measurements at multiple different sites on a suburban college campus. The data set also includes information about where the data samples were collected and the conditions of the site during collection.
Let’s begin by using the summary
function to examine
this data set. As above, let’s look for the following:
summary(soil_data)
## date condition site replicate
## Length:25 Length:25 Length:25 Min. :1
## Class :character Class :character Class :character 1st Qu.:2
## Mode :character Mode :character Mode :character Median :3
## Mean :3
## 3rd Qu.:4
## Max. :5
## pH nitrate_lbacre phosphorus_lbacre
## Min. :6.000 Min. : 5.0 Min. : 5.0
## 1st Qu.:6.200 1st Qu.: 5.0 1st Qu.: 5.0
## Median :6.600 Median :10.0 Median :10.0
## Mean :6.564 Mean : 9.4 Mean :12.3
## 3rd Qu.:6.800 3rd Qu.:10.0 3rd Qu.:17.5
## Max. :7.200 Max. :20.0 Max. :25.0