The goal of the Data Report assignment is for you to demonstrate your ability to use R to examine a data set. This includes reading in a data set, calculating summary statistics for a data set, and creating data visualizations.
Please submit your Data Report as both an R Markdown (Rmd) file and a Word (docx) file. See the introduction to Assignment 1 if you need more information about the Rmd format.
I am expecting a Data Report that is written as a consistent narrative, rather than simply answers to questions posed below. Imagine you are submitting this report to a supervisor who has asked you to review the data set and deliver a few key points about it. You should write in complete sentences and paragraphs, with a logical order and flow. Your code should be embedded within the report as R chunks in the appropriate sections.
Here are the sections I am expecting in your report:
Your plots should have x- and y-axis labels that are meaningful AND figure captions that describe the figure with enough detail that a reader could understand the figure without reading the rest of the report.
For this Data Report, you will work with the NYS Spill Incidents Data Set. This is a data set of more than 500k records related to petroleum and other hazardous materials spills here in NY state. An overview of this data set is provided in a pdf document linked from the website above. Below is draft code to download this data set. Note that this is a fairly large date set (approx. 92 MB), and may take a few moments to download.
NOTE: this chunk is set as eval = FALSE
. This is
because you only need to download the data once. When this is complete
and the data set is saved to your computer, you do not need to download
the data during the knitting process.
download.file(url = "https://raw.githubusercontent.com/mlammens/ENS-623-Research-Stats/88869202dfbf499d573233a1b47b8bc6bbb0eecb/data/Spill_Incidents_20250303.csv",
destfile = "data/Spill_Incidents_SP25.csv")
Here is the code you should use the load these data into your R
environment. The path to this file may differ if you saved the data file
somewhere other than in your data
directory.
library(tidyverse)
spills <- read_csv("data/Spill_Incidents_SP25.csv")
Using these data and the skills and knowledge you’ve learned in class, address each of the following tasks.
Using the summary
function, and any other functions
you wish, look at these data and describe at least five observations /
points about the data. NOTE - you may have to transform
some of the data.frame columns to make them useful for data summaries
(e.g., from characters to factors).
Calculate the number of spills reported for each different toxins
(i.e., Material Name
). Make sure that the resulting table
presents the data descending order, from most common material spilled to
least common.
Calculate the probability that a reported spill is related to “#2 fuel oil”. Would you describe this as a likely event based on this result?
Make a histogram of the quantity
of “#2 fuel oil”
fuel spilled, for reported spills related to “#2 fuel oil”.
NOTE - you should filter your data so the quantity is
less than 1000, because there are some extremely large outlier
spills.
Calculate the probability that a reported spill is related to “#2 fuel oil” fuel AND had a quantity less than or equal to 100 gallons. Is it more likely that a #2 fuel oil spill was greater than or less than 100 gallons? Explain your answer.
Are there any observable trends through time in the number of #2 fuel oil spills reported per year? NOTE - there are many ways for you to answer this question, but you must justify your answer with your R code.
Material Name
- e.g., diesel
,
motor oil
, etc.