Overview

In this activity, you will be introduced to some basic data processing and exploration steps using data from the National Ecological Observatory Network (NEON). You can learn more about NEON at this website - https://www.neonscience.org/. This activity is designed so that you can complete it using a mobile device, such as your phone, in conjunction with RStudio Cloud, https://rstudio.cloud/. Throughout the first part of the activity, we will work with the NEON Plant Presence and Percent Cover data set. More details on these data can be found here - https://data.neonscience.org/data-products/DP1.10058.001.

Learning Objectives

By completing this activity, you should be able to:

Quantitative / Data Skills

  • explain the importance of data management in the scientific process
  • modify a data set using a data.frame object in R
  • modify dplyr code to summerise a data set
  • calculate plot-level species richness values

Concept learning objectives

  • interpret data visualizations, such as histograms
  • develop hypotheses related to differences in plant species richness among NEON sites

Preliminaries

Starting your RStudio session

In order to carry out the analyses in this activity, you will first need to initate a new R session. If you are on a desktop computer with RStudio installed, start a new RStudio session. (NOTE: In this activity, I am not being careful in distiguishing R and RStudio. Strictly speaking, this is poor practice. R is the computer programming language and RStudio is an Integrated Development Environment that makes using R easier. You need to have both programs installed on your desktop.) If you do not have RStudio installed, or you are working on a mobile device, you can sign-up for a free RStudio Cloud account, where you can run an RStudio session through a browser. Go to https://rstudio.cloud/ to begin this process.

Getting our Rmd file

Next, let’s get the Rmd file that will allow us to run all of the code for this activity. This Rmd file is also the file that translates into this webpage! (Yes, that’s meta.) We will use the download.file function to get this Rmd file. You will need to copy and paste this line in your RStudio session.

download.file("https://github.com/mlammens/NEONonMobile/raw/master/working-with-neon.Rmd",
              destfile = "./temp.Rmd")

Install and load a few packages into your R environment

If you are working on a mobile device …

install.packages("dplyr")
install.packages("ggplot2")
install.packages("neonUtilities")

Load packages

library(neonUtilities)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

# Set global option to NOT convert all character variables to factors
options(stringsAsFactors=F)

Working with plant presence and percent cover data

We will begin by working with NEONs plant presence and percent cover data set. More details on these data can be found here - https://data.neonscience.org/data-products/DP1.10058.001.

Getting data using loadByProduct

First, we will use the loadByProduct function in neonUtilities.

NOTE: Here we are using the check.size = FALSE argument. We are doing this to prevent the function from sending a query to the console, and because we know these data are relatively small in size (approx. 2.5 MB). However, if working with potentially larger data sets, then you should set check.size = TRUE.

plants <- loadByProduct(dpID = "DP1.10058.001", site = c("HARV", "BART"), check.size = FALSE)
## Downloading files totaling approximately 2.490959 MB
## Downloading 22 files
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |======================================================================| 100%
## 
## Unpacking zip files using 1 cores.
## Stacking operation across a single core.
## Stacking table div_10m2Data100m2Data
## Stacking table div_1m2Data
## Copied the most recent publication of validation file to /stackedFiles
## Copied the most recent publication of variable definition file to /stackedFiles
## Finished: Stacked 2 data tables and 2 metadata tables!
## Stacking took 1.294869 secs

Alternative approach - working with zipped files after manual download from data portal

Stack data

stackByTable("NEON_presence-cover-plant.zip")

Read data

plant_data_1m <- read.csv("NEON_presence-cover-plant/stackedFiles/div_1m2Data.csv")

Exploring the data

Let’s examine the data we just recieved. They should be stored as a list object. You can find out more information about lists here - https://stat.ethz.ch/R-manual/R-devel/library/base/html/list.html.

For our immediate perposes, what we need to know is that we can access any element of a list with the $ operator. But first, let’s get the names of these different objects.

names(plants)
## [1] "div_10m2Data100m2Data" "div_1m2Data"           "readme_10058"         
## [4] "validation_10058"      "variables_10058"

The readme_10058 element looks promising.

plants$readme_10058
## # A tibble: 2,063 x 1
##    X1                                                                           
##    <chr>                                                                        
##  1 ###################################                                          
##  2 ########### Disclaimer ############                                          
##  3 This is the most recent readme publication based on all site-date combinatio…
##  4 Information specific to the query, including sites and dates, has been remov…
##  5 All files used during stacking are listed at the bottom of this document, wh…
##  6 ##################################                                           
##  7 This data package been produced by and downloaded from the National Ecologic…
##  8 DATA PRODUCT INFORMATION                                                     
##  9 ------------------------                                                     
## 10 ID: NEON.DOM.SITE.DP1.10058.001                                              
## # … with 2,053 more rows

OK. If we navigated that readme just a bit, we would have learned that there are at least two data sets with the actual vegetation survey data: div_1m2Data and div_10m2Data_100m2Data.

Let’s have a peek at each.

head(plants$div_1m2Data)
##                                    uid         namedLocation domainID siteID
## 1 9220a6d0-9fc1-4a74-8b3d-cc55905d2c38 BART_006.basePlot.div      D01   BART
## 2 90f009ce-e411-4479-96e6-26e0db69eed0 BART_006.basePlot.div      D01   BART
## 3 1e19e64f-0b83-42f6-a5fb-64a0f24ebce0 BART_006.basePlot.div      D01   BART
## 4 0a131547-e5d4-4b14-bff0-1bb6b8751e08 BART_006.basePlot.div      D01   BART
## 5 c3552194-bed9-4435-8079-2a2c73639f40 BART_006.basePlot.div      D01   BART
## 6 9b861991-5f2a-4470-b534-1b9ab314b090 BART_006.basePlot.div      D01   BART
##   decimalLatitude decimalLongitude geodeticDatum coordinateUncertainty
## 1        44.06051        -71.31091         WGS84                 20.11
## 2        44.06051        -71.31091         WGS84                 20.11
## 3        44.06051        -71.31091         WGS84                 20.11
## 4        44.06051        -71.31091         WGS84                 20.11
## 5        44.06051        -71.31091         WGS84                 20.11
## 6        44.06051        -71.31091         WGS84                 20.11
##   elevation elevationUncertainty       nlcdClass   plotID subplotID    endDate
## 1     432.7                  0.2 deciduousForest BART_006    40.3.1 2014-06-10
## 2     432.7                  0.2 deciduousForest BART_006    40.3.1 2014-06-10
## 3     432.7                  0.2 deciduousForest BART_006    40.3.1 2014-06-10
## 4     432.7                  0.2 deciduousForest BART_006    40.3.1 2014-06-10
## 5     432.7                  0.2 deciduousForest BART_006    40.3.1 2014-06-10
## 6     432.7                  0.2 deciduousForest BART_006    31.1.1 2014-06-10
##   boutNumber samplingProtocolVersion    divDataType targetTaxaPresent
## 1          1                    <NA> otherVariables              <NA>
## 2          1                    <NA> otherVariables              <NA>
## 3          1                    <NA>   plantSpecies                 Y
## 4          1                    <NA> otherVariables              <NA>
## 5          1                    <NA> otherVariables              <NA>
## 6          1                    <NA> otherVariables              <NA>
##   otherVariablesPresent taxonID              scientificName taxonRank
## 1                     Y    <NA>                        <NA>      <NA>
## 2                     Y    <NA>                        <NA>      <NA>
## 3                  <NA>  VILA11 Viburnum lantanoides Michx.   species
## 4                     Y    <NA>                        <NA>      <NA>
## 5                     Y    <NA>                        <NA>      <NA>
## 6                     Y    <NA>                        <NA>      <NA>
##           family nativeStatusCode identificationQualifier taxonIDRemarks
## 1           <NA>             <NA>                    <NA>           <NA>
## 2           <NA>             <NA>                    <NA>           <NA>
## 3 Caprifoliaceae                N                    <NA>           <NA>
## 4           <NA>             <NA>                    <NA>           <NA>
## 5           <NA>             <NA>                    <NA>           <NA>
## 6           <NA>             <NA>                    <NA>           <NA>
##   morphospeciesID morphospeciesIDRemarks identificationReferences
## 1            <NA>                   <NA>                     <NA>
## 2            <NA>                   <NA>                     <NA>
## 3            <NA>                   <NA>                     <NA>
## 4            <NA>                   <NA>                     <NA>
## 5            <NA>                   <NA>                     <NA>
## 6            <NA>                   <NA>                     <NA>
##   otherVariables percentCover heightPlantOver300cm heightPlantSpecies remarks
## 1         lichen          0.5                 <NA>                 NA    <NA>
## 2           wood          1.0                 <NA>                 NA    <NA>
## 3           <NA>         18.0                    N                 23    <NA>
## 4           rock          1.0                 <NA>                 NA    <NA>
## 5      overstory        100.0                 <NA>                 NA    <NA>
## 6           rock          2.0                 <NA>                 NA    <NA>
##              measuredBy             recordedBy  publicationDate
## 1 dcrandall@neoninc.org  jsusser@field-ops.org 20200225T182334Z
## 2 dcrandall@neoninc.org  jsusser@field-ops.org 20200225T182334Z
## 3 dcrandall@neoninc.org  JSusser@field-ops.org 20200225T182334Z
## 4 dcrandall@neoninc.org  jsusser@field-ops.org 20200225T182334Z
## 5 dcrandall@neoninc.org  jsusser@field-ops.org 20200225T182334Z
## 6 dcrandall@neoninc.org bchagnon@field-ops.org 20200225T182334Z
head(plants$div_10m2Data100m2Data)
##                                    uid         namedLocation domainID siteID
## 1 d15e53cd-eac5-4451-acab-c1a76118ec5b BART_006.basePlot.div      D01   BART
## 2 36327acd-8580-49a0-b02a-d02e64725c47 BART_006.basePlot.div      D01   BART
## 3 0fa7cced-3dc7-4727-84b1-afa12dcf09b7 BART_006.basePlot.div      D01   BART
## 4 5049b515-597c-45b7-9f92-72fd3339bb0a BART_006.basePlot.div      D01   BART
## 5 0b8813dd-5adc-4ed9-a699-fac357270ec6 BART_006.basePlot.div      D01   BART
## 6 0a814211-f7c0-4526-9e0c-f0b3bbbb4e17 BART_006.basePlot.div      D01   BART
##   decimalLatitude decimalLongitude geodeticDatum coordinateUncertainty
## 1        44.06051        -71.31091         WGS84                 20.11
## 2        44.06051        -71.31091         WGS84                 20.11
## 3        44.06051        -71.31091         WGS84                 20.11
## 4        44.06051        -71.31091         WGS84                 20.11
## 5        44.06051        -71.31091         WGS84                 20.11
## 6        44.06051        -71.31091         WGS84                 20.11
##   elevation elevationUncertainty       nlcdClass   plotID subplotID    endDate
## 1     432.7                  0.2 deciduousForest BART_006        41 2014-06-10
## 2     432.7                  0.2 deciduousForest BART_006        41 2014-06-10
## 3     432.7                  0.2 deciduousForest BART_006        41 2014-06-10
## 4     432.7                  0.2 deciduousForest BART_006   40.3.10 2014-06-10
## 5     432.7                  0.2 deciduousForest BART_006   31.4.10 2014-06-10
## 6     432.7                  0.2 deciduousForest BART_006        32 2014-06-10
##   boutNumber samplingProtocolVersion targetTaxaPresent taxonID
## 1          1                    <NA>                 Y    TSCA
## 2          1                    <NA>                 Y   ABIES
## 3          1                    <NA>                 Y    PIST
## 4          1                    <NA>                 Y    ACPE
## 5          1                    <NA>                 Y   ACSA3
## 6          1                    <NA>                 Y    ACPE
##                   scientificName taxonRank    family nativeStatusCode
## 1 Tsuga canadensis (L.) Carrière   species  Pinaceae                N
## 2                      Abies sp.     genus  Pinaceae              UNK
## 3               Pinus strobus L.   species  Pinaceae                N
## 4          Acer pensylvanicum L.   species Aceraceae                N
## 5        Acer saccharum Marshall   species Aceraceae                N
## 6          Acer pensylvanicum L.   species Aceraceae                N
##   identificationQualifier taxonIDRemarks morphospeciesID morphospeciesIDRemarks
## 1                    <NA>           <NA>            <NA>                   <NA>
## 2                    <NA>           <NA>            <NA>                   <NA>
## 3                    <NA>           <NA>            <NA>                   <NA>
## 4                    <NA>           <NA>            <NA>                   <NA>
## 5                    <NA>           <NA>            <NA>                   <NA>
## 6                    <NA>           <NA>            <NA>                   <NA>
##   identificationReferences additionalSpecies remarks              measuredBy
## 1                     <NA>              <NA>    <NA> CBlodgett@field-ops.org
## 2                     <NA>              <NA>    <NA> CBlodgett@field-ops.org
## 3                     <NA>              <NA>    <NA> CBlodgett@field-ops.org
## 4                     <NA>              <NA>    <NA>   dcrandall@neoninc.org
## 5                     <NA>              <NA>    <NA>   dcrandall@neoninc.org
## 6                     <NA>              <NA>    <NA>   dcrandall@neoninc.org
##               recordedBy  publicationDate
## 1  dcrandall@neoninc.org 20200225T182334Z
## 2  dcrandall@neoninc.org 20200225T182334Z
## 3  dcrandall@neoninc.org 20200225T182334Z
## 4  JSusser@field-ops.org 20200225T182334Z
## 5 bchagnon@field-ops.org 20200225T182334Z
## 6  JSusser@field-ops.org 20200225T182334Z

How many rows are in each?

1m2 plots

nrow(plants$div_1m2Data)
## [1] 33599

10m2 / 100m2 plots

nrow(plants$div_10m2Data100m2Data)
## [1] 21771

Those are tens of thousands of rows!

Let’s have a look at how many unique species occur in each data set.

1m2 plots

length(unique(plants$div_1m2Data$scientificName))
## [1] 286

10m2 / 100m2 plots

length(unique(plants$div_10m2Data100m2Data$scientificName))
## [1] 410

Calculating data summaries

We are going to do some quick and dirty data summaries.

First, let’s calculate species richness values for plots. Here, we are going to work at the plot level, including all of the data from the subplots. To do this, we first need to make a new column that describes just the plot number.

Combine data sets

all_plants <- rbind(
  select(plants$div_10m2Data100m2Data, siteID, plotID, subplotID, endDate, scientificName, family, nativeStatusCode), 
  select(plants$div_1m2Data, siteID, plotID, subplotID, endDate, scientificName, family, nativeStatusCode))

all_plants <- filter(all_plants, !is.na(scientificName))
all_plants$mainsubplotID <- as.numeric(gsub(pattern = "\\..*", replacement = "", all_plants$subplotID))
all_plants_summary <-
  all_plants %>%
  group_by(siteID, plotID) %>%
  summarise(Richness = length(unique(scientificName)))
all_plants_summary
## # A tibble: 68 x 3
## # Groups:   siteID [2]
##    siteID plotID   Richness
##    <chr>  <chr>       <int>
##  1 BART   BART_001       39
##  2 BART   BART_002       33
##  3 BART   BART_003       43
##  4 BART   BART_004       27
##  5 BART   BART_005       34
##  6 BART   BART_006       28
##  7 BART   BART_007       25
##  8 BART   BART_008       20
##  9 BART   BART_010       27
## 10 BART   BART_011       25
## # … with 58 more rows
ggplot(data = all_plants_summary) +
  geom_histogram(aes(x = Richness, fill = siteID), position = "dodge")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

t.test(all_plants_summary$Richness ~ all_plants_summary$siteID)
## 
##  Welch Two Sample t-test
## 
## data:  all_plants_summary$Richness by all_plants_summary$siteID
## t = -7.625, df = 38.692, p-value = 3.175e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -53.05435 -30.80365
## sample estimates:
## mean in group BART mean in group HARV 
##           30.82857           72.75758
length(unique(all_plants$plotID))
## [1] 68
length(unique(all_plants$mainsubplotID))
## [1] 4