In this activity, you will be introduced to some basic data processing and exploration steps using data from the National Ecological Observatory Network (NEON). You can learn more about NEON at this website - https://www.neonscience.org/. This activity is designed so that you can complete it using a mobile device, such as your phone, in conjunction with RStudio Cloud, https://rstudio.cloud/. Throughout the first part of the activity, we will work with the NEON Plant Presence and Percent Cover data set. More details on these data can be found here - https://data.neonscience.org/data-products/DP1.10058.001.
By completing this activity, you should be able to:
data.frame
object in Rdplyr
code to summerise a data setIn order to carry out the analyses in this activity, you will first need to initate a new R session. If you are on a desktop computer with RStudio installed, start a new RStudio session. (NOTE: In this activity, I am not being careful in distiguishing R and RStudio. Strictly speaking, this is poor practice. R is the computer programming language and RStudio is an Integrated Development Environment that makes using R easier. You need to have both programs installed on your desktop.) If you do not have RStudio installed, or you are working on a mobile device, you can sign-up for a free RStudio Cloud account, where you can run an RStudio session through a browser. Go to https://rstudio.cloud/ to begin this process.
Next, let’s get the Rmd file that will allow us to run all of the code for this activity. This Rmd file is also the file that translates into this webpage! (Yes, that’s meta.) We will use the download.file
function to get this Rmd file. You will need to copy and paste this line in your RStudio session.
download.file("https://github.com/mlammens/NEONonMobile/raw/master/working-with-neon.Rmd",
destfile = "./temp.Rmd")
If you are working on a mobile device …
install.packages("dplyr")
install.packages("ggplot2")
install.packages("neonUtilities")
Load packages
library(neonUtilities)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# Set global option to NOT convert all character variables to factors
options(stringsAsFactors=F)
We will begin by working with NEONs plant presence and percent cover data set. More details on these data can be found here - https://data.neonscience.org/data-products/DP1.10058.001.
loadByProduct
First, we will use the loadByProduct
function in neonUtilities
.
NOTE: Here we are using the check.size = FALSE
argument. We are doing this to prevent the function from sending a query to the console, and because we know these data are relatively small in size (approx. 2.5 MB). However, if working with potentially larger data sets, then you should set check.size = TRUE
.
plants <- loadByProduct(dpID = "DP1.10058.001", site = c("HARV", "BART"), check.size = FALSE)
## Downloading files totaling approximately 2.490959 MB
## Downloading 22 files
##
|
| | 0%
|
|=== | 5%
|
|====== | 9%
|
|========== | 14%
|
|============= | 18%
|
|================ | 23%
|
|=================== | 27%
|
|====================== | 32%
|
|========================= | 36%
|
|============================= | 41%
|
|================================ | 45%
|
|=================================== | 50%
|
|====================================== | 55%
|
|========================================= | 59%
|
|============================================= | 64%
|
|================================================ | 68%
|
|=================================================== | 73%
|
|====================================================== | 77%
|
|========================================================= | 82%
|
|============================================================ | 86%
|
|================================================================ | 91%
|
|=================================================================== | 95%
|
|======================================================================| 100%
##
## Unpacking zip files using 1 cores.
## Stacking operation across a single core.
## Stacking table div_10m2Data100m2Data
## Stacking table div_1m2Data
## Copied the most recent publication of validation file to /stackedFiles
## Copied the most recent publication of variable definition file to /stackedFiles
## Finished: Stacked 2 data tables and 2 metadata tables!
## Stacking took 1.294869 secs
Stack data
stackByTable("NEON_presence-cover-plant.zip")
Read data
plant_data_1m <- read.csv("NEON_presence-cover-plant/stackedFiles/div_1m2Data.csv")
Let’s examine the data we just recieved. They should be stored as a list
object. You can find out more information about lists here - https://stat.ethz.ch/R-manual/R-devel/library/base/html/list.html.
For our immediate perposes, what we need to know is that we can access any element of a list
with the $
operator. But first, let’s get the names of these different objects.
names(plants)
## [1] "div_10m2Data100m2Data" "div_1m2Data" "readme_10058"
## [4] "validation_10058" "variables_10058"
The readme_10058
element looks promising.
plants$readme_10058
## # A tibble: 2,063 x 1
## X1
## <chr>
## 1 ###################################
## 2 ########### Disclaimer ############
## 3 This is the most recent readme publication based on all site-date combinatio…
## 4 Information specific to the query, including sites and dates, has been remov…
## 5 All files used during stacking are listed at the bottom of this document, wh…
## 6 ##################################
## 7 This data package been produced by and downloaded from the National Ecologic…
## 8 DATA PRODUCT INFORMATION
## 9 ------------------------
## 10 ID: NEON.DOM.SITE.DP1.10058.001
## # … with 2,053 more rows
OK. If we navigated that readme just a bit, we would have learned that there are at least two data sets with the actual vegetation survey data: div_1m2Data
and div_10m2Data_100m2Data
.
Let’s have a peek at each.
head(plants$div_1m2Data)
## uid namedLocation domainID siteID
## 1 9220a6d0-9fc1-4a74-8b3d-cc55905d2c38 BART_006.basePlot.div D01 BART
## 2 90f009ce-e411-4479-96e6-26e0db69eed0 BART_006.basePlot.div D01 BART
## 3 1e19e64f-0b83-42f6-a5fb-64a0f24ebce0 BART_006.basePlot.div D01 BART
## 4 0a131547-e5d4-4b14-bff0-1bb6b8751e08 BART_006.basePlot.div D01 BART
## 5 c3552194-bed9-4435-8079-2a2c73639f40 BART_006.basePlot.div D01 BART
## 6 9b861991-5f2a-4470-b534-1b9ab314b090 BART_006.basePlot.div D01 BART
## decimalLatitude decimalLongitude geodeticDatum coordinateUncertainty
## 1 44.06051 -71.31091 WGS84 20.11
## 2 44.06051 -71.31091 WGS84 20.11
## 3 44.06051 -71.31091 WGS84 20.11
## 4 44.06051 -71.31091 WGS84 20.11
## 5 44.06051 -71.31091 WGS84 20.11
## 6 44.06051 -71.31091 WGS84 20.11
## elevation elevationUncertainty nlcdClass plotID subplotID endDate
## 1 432.7 0.2 deciduousForest BART_006 40.3.1 2014-06-10
## 2 432.7 0.2 deciduousForest BART_006 40.3.1 2014-06-10
## 3 432.7 0.2 deciduousForest BART_006 40.3.1 2014-06-10
## 4 432.7 0.2 deciduousForest BART_006 40.3.1 2014-06-10
## 5 432.7 0.2 deciduousForest BART_006 40.3.1 2014-06-10
## 6 432.7 0.2 deciduousForest BART_006 31.1.1 2014-06-10
## boutNumber samplingProtocolVersion divDataType targetTaxaPresent
## 1 1 <NA> otherVariables <NA>
## 2 1 <NA> otherVariables <NA>
## 3 1 <NA> plantSpecies Y
## 4 1 <NA> otherVariables <NA>
## 5 1 <NA> otherVariables <NA>
## 6 1 <NA> otherVariables <NA>
## otherVariablesPresent taxonID scientificName taxonRank
## 1 Y <NA> <NA> <NA>
## 2 Y <NA> <NA> <NA>
## 3 <NA> VILA11 Viburnum lantanoides Michx. species
## 4 Y <NA> <NA> <NA>
## 5 Y <NA> <NA> <NA>
## 6 Y <NA> <NA> <NA>
## family nativeStatusCode identificationQualifier taxonIDRemarks
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 Caprifoliaceae N <NA> <NA>
## 4 <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA>
## morphospeciesID morphospeciesIDRemarks identificationReferences
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 <NA> <NA> <NA>
## 5 <NA> <NA> <NA>
## 6 <NA> <NA> <NA>
## otherVariables percentCover heightPlantOver300cm heightPlantSpecies remarks
## 1 lichen 0.5 <NA> NA <NA>
## 2 wood 1.0 <NA> NA <NA>
## 3 <NA> 18.0 N 23 <NA>
## 4 rock 1.0 <NA> NA <NA>
## 5 overstory 100.0 <NA> NA <NA>
## 6 rock 2.0 <NA> NA <NA>
## measuredBy recordedBy publicationDate
## 1 dcrandall@neoninc.org jsusser@field-ops.org 20200225T182334Z
## 2 dcrandall@neoninc.org jsusser@field-ops.org 20200225T182334Z
## 3 dcrandall@neoninc.org JSusser@field-ops.org 20200225T182334Z
## 4 dcrandall@neoninc.org jsusser@field-ops.org 20200225T182334Z
## 5 dcrandall@neoninc.org jsusser@field-ops.org 20200225T182334Z
## 6 dcrandall@neoninc.org bchagnon@field-ops.org 20200225T182334Z
head(plants$div_10m2Data100m2Data)
## uid namedLocation domainID siteID
## 1 d15e53cd-eac5-4451-acab-c1a76118ec5b BART_006.basePlot.div D01 BART
## 2 36327acd-8580-49a0-b02a-d02e64725c47 BART_006.basePlot.div D01 BART
## 3 0fa7cced-3dc7-4727-84b1-afa12dcf09b7 BART_006.basePlot.div D01 BART
## 4 5049b515-597c-45b7-9f92-72fd3339bb0a BART_006.basePlot.div D01 BART
## 5 0b8813dd-5adc-4ed9-a699-fac357270ec6 BART_006.basePlot.div D01 BART
## 6 0a814211-f7c0-4526-9e0c-f0b3bbbb4e17 BART_006.basePlot.div D01 BART
## decimalLatitude decimalLongitude geodeticDatum coordinateUncertainty
## 1 44.06051 -71.31091 WGS84 20.11
## 2 44.06051 -71.31091 WGS84 20.11
## 3 44.06051 -71.31091 WGS84 20.11
## 4 44.06051 -71.31091 WGS84 20.11
## 5 44.06051 -71.31091 WGS84 20.11
## 6 44.06051 -71.31091 WGS84 20.11
## elevation elevationUncertainty nlcdClass plotID subplotID endDate
## 1 432.7 0.2 deciduousForest BART_006 41 2014-06-10
## 2 432.7 0.2 deciduousForest BART_006 41 2014-06-10
## 3 432.7 0.2 deciduousForest BART_006 41 2014-06-10
## 4 432.7 0.2 deciduousForest BART_006 40.3.10 2014-06-10
## 5 432.7 0.2 deciduousForest BART_006 31.4.10 2014-06-10
## 6 432.7 0.2 deciduousForest BART_006 32 2014-06-10
## boutNumber samplingProtocolVersion targetTaxaPresent taxonID
## 1 1 <NA> Y TSCA
## 2 1 <NA> Y ABIES
## 3 1 <NA> Y PIST
## 4 1 <NA> Y ACPE
## 5 1 <NA> Y ACSA3
## 6 1 <NA> Y ACPE
## scientificName taxonRank family nativeStatusCode
## 1 Tsuga canadensis (L.) Carrière species Pinaceae N
## 2 Abies sp. genus Pinaceae UNK
## 3 Pinus strobus L. species Pinaceae N
## 4 Acer pensylvanicum L. species Aceraceae N
## 5 Acer saccharum Marshall species Aceraceae N
## 6 Acer pensylvanicum L. species Aceraceae N
## identificationQualifier taxonIDRemarks morphospeciesID morphospeciesIDRemarks
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA>
## identificationReferences additionalSpecies remarks measuredBy
## 1 <NA> <NA> <NA> CBlodgett@field-ops.org
## 2 <NA> <NA> <NA> CBlodgett@field-ops.org
## 3 <NA> <NA> <NA> CBlodgett@field-ops.org
## 4 <NA> <NA> <NA> dcrandall@neoninc.org
## 5 <NA> <NA> <NA> dcrandall@neoninc.org
## 6 <NA> <NA> <NA> dcrandall@neoninc.org
## recordedBy publicationDate
## 1 dcrandall@neoninc.org 20200225T182334Z
## 2 dcrandall@neoninc.org 20200225T182334Z
## 3 dcrandall@neoninc.org 20200225T182334Z
## 4 JSusser@field-ops.org 20200225T182334Z
## 5 bchagnon@field-ops.org 20200225T182334Z
## 6 JSusser@field-ops.org 20200225T182334Z
How many rows are in each?
1m2 plots
nrow(plants$div_1m2Data)
## [1] 33599
10m2 / 100m2 plots
nrow(plants$div_10m2Data100m2Data)
## [1] 21771
Those are tens of thousands of rows!
Let’s have a look at how many unique species occur in each data set.
1m2 plots
length(unique(plants$div_1m2Data$scientificName))
## [1] 286
10m2 / 100m2 plots
length(unique(plants$div_10m2Data100m2Data$scientificName))
## [1] 410
We are going to do some quick and dirty data summaries.
First, let’s calculate species richness values for plots. Here, we are going to work at the plot level, including all of the data from the subplots. To do this, we first need to make a new column that describes just the plot number.
Combine data sets
all_plants <- rbind(
select(plants$div_10m2Data100m2Data, siteID, plotID, subplotID, endDate, scientificName, family, nativeStatusCode),
select(plants$div_1m2Data, siteID, plotID, subplotID, endDate, scientificName, family, nativeStatusCode))
all_plants <- filter(all_plants, !is.na(scientificName))
all_plants$mainsubplotID <- as.numeric(gsub(pattern = "\\..*", replacement = "", all_plants$subplotID))
all_plants_summary <-
all_plants %>%
group_by(siteID, plotID) %>%
summarise(Richness = length(unique(scientificName)))
all_plants_summary
## # A tibble: 68 x 3
## # Groups: siteID [2]
## siteID plotID Richness
## <chr> <chr> <int>
## 1 BART BART_001 39
## 2 BART BART_002 33
## 3 BART BART_003 43
## 4 BART BART_004 27
## 5 BART BART_005 34
## 6 BART BART_006 28
## 7 BART BART_007 25
## 8 BART BART_008 20
## 9 BART BART_010 27
## 10 BART BART_011 25
## # … with 58 more rows
ggplot(data = all_plants_summary) +
geom_histogram(aes(x = Richness, fill = siteID), position = "dodge")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
t.test(all_plants_summary$Richness ~ all_plants_summary$siteID)
##
## Welch Two Sample t-test
##
## data: all_plants_summary$Richness by all_plants_summary$siteID
## t = -7.625, df = 38.692, p-value = 3.175e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -53.05435 -30.80365
## sample estimates:
## mean in group BART mean in group HARV
## 30.82857 72.75758
length(unique(all_plants$plotID))
## [1] 68
length(unique(all_plants$mainsubplotID))
## [1] 4