Due: Sunday, 10-Nov. at 8pm
Rules:
Instructions:
Before beginning this assignment, be sure to have read the Data Frames and Data Wrangling lessons.
If you haven’t installed the dplyr library yet, do so now:
install.packages("dplyr")
library(dplyr)
Open Rtudio and create a new project called “hw5-lastName”, replacing “lastName” with your last name.
Download the hw5.R template script and place it in RStudio project folder you just created.
Create a new folder called “data” inside your RStudio project folder you just created (you’ll put data in this folder later).
Fill out your name, GW Net ID, and the names of anyone you worked with in the header of the “hw5.R” file.
Type all of your answers to the questions below in the “hw5.R” script.
After completing the questions, create a zip file of all files in your R project folder for this assignment.
Submit the zip file on Blackboard by the due deadline.
For this assignment, we will work with data on flights from New York City airports during 2013. To load the data, install and load the nycflights13 package.
Look at the datasets that are included in this package:
data(package = "nycflights13")
Data sets in package 'nycflights13':
airlines Airline names.
airports Airport metadata
flights Flights data
planes Plane metadata.
weather Hourly weather data
Write some code to preview and summarize each of these data frames using some of the methods we’ve seen in class and in the lessons on data frames and data wrangling. You should be able to quickly get an understanding of what variables are included in each data frame and their nature. For each dataset, consider the following questions in your exploration:
NA
values than others? Why might that be the case?Use the data frames in the nycflights13 library to answer the following questions. For each question, write R code to find the solution. Leave comments where appropriate to explain what you are doing, and then write your final answer as a comment at the end.
For example, if the question was “how many observations are in the
flights
data frame?”, here is an acceptable answer:
# Find the number of rows in the flights data frame
nrow(flights)
## [1] 336776
# Answer: There are 336,776 observations in the flights data frame
You do not have to use the dplyr library functions
(i.e. filter()
, arrange()
,
mutate()
, etc.) to answer these questions, but it is
strongly encouraged.
How many flights out of NYC airports in 2013 had an arrival delay of
two or more hours? Hint: use filter()
How many flights out of NYC airports in 2013 departed in fall
semester (i.e. the months August - December, inclusive)? Hint: use
filter()
How many flights out of NYC airports in 2013 arrived more
than two hours late to their destinations, but did not depart
an NYC airport late? Hint: use filter()
How many flights out of NYC airports in 2013 were operated by United,
American, or Delta airlines? Hint: use filter()
List the top 3 airlines (by name, not carrier code) that had the
highest delay time of any one flight leaving a NYC airport in 2013.
Hint: use arrange()
How many flights out of NYC airports in 2013 flew to the 3 major
DC-area airports: Reagan National, Dulles, or BWI? Hint: use
filter()
What is the year manufactured and tail number of the oldest airplane
that any one airline used in 2013 to fly out of NYC airports, and which
airline operated that plane? Hint: use arrange()
and
filter()
Using the flights
data frame, compute a new variable
speed
(in miles per hour) using the air_time
and distance
variables. For the fastest flight in the
dataset, what was its speed and what were the origin and destination
airport codes? Hint: use mutate()
and
arrange()
Using the flights
data frame, compute a new variable
delta_time
(in minutes) that is equal to the amount of time
that was either lost or made up during the flight. “Lost” time is less
than 0 and reflects a flight time that was longer than
scheduled, while “made up” time is greater than 0 and reflects a flight
time that was faster than scheduled. For the flight that made
up the most time during its flight, how much time was made up (in
minutes) and what were the origin and destination airport codes? Hint:
use mutate()
and arrange()
Of all the flights in 2013 departing from NYC airports, list the top
3 destinations (airport names, not airport codes) with the highest mean
arrival delay. Hint: Use a “pipeline” of group_by()
,
summarise()
, and arrange()
. Don’t forget to
filter out any NA
values before summarizing!
Use the flights
data frame to create a new summary data
frame called dailyDelaySummary
that contains the following
variables for each day in 2013:
meanDepDelay
: the mean departure delay (in
minutes)numDelayedFlights
: the total number of delayed
flightsSave this file in your “data” folder as “dailyDelaySummary.csv” Hint:
Use a “pipeline” of group_by()
and
summarise()
. Don’t forget to filter out any NA
values before summarizing!
Using the dailyDelaySummary
data frame that you created
in part k), answer the following two questions:
How many flights have a missing departure time? What might these rows represent?
Which flights (i.e. carrier + flight) departed every day of the year, and which airports did they fly to?
Use the flights data frame to find which season has the highest mean flight departure delays. The seasons are defined as the following:
What season experiences the largest mean delay, and why might that
be? Hint: Use a “pipeline” of mutate
,
group_by()
and summarise()
. Don’t forget to
filter out any NA
values before summarizing! Also, you may
need to use the if_else()
function - see the Tips section in the lesson
on data wrangling!