Due: Sunday, 01-Dec. at 8pm

Rules: This entire assignment is SOLO. You may not work with other classmates, though you may consult instructors for help.

Overview: For this assignment, you will need to find a dataset of your choosing and create a short report that contains a description of the dataset as well as a few key insights about the data, including at least two summary visualizations. You will publish your report online at http://rpubs.com/. Before beginning the assignment, be sure to have reviewed the lessons on Data Frames, Data Wrangling, Data Visualization and Communicating Information.


Mini data analysis project

1) Choose a dataset [5 pts]

You will need to select a dataset for your project. To keep things manageable, you must choose one of the following datasets from the following libraries. Note that to load any of these data frames, all you need to do is load the library:

dplyr:

  • storms
  • starwars

ggplot2:

  • diamonds
  • economics
  • midwest
  • mpg
  • msleep
  • txhousing

dslabs:

  • brexit_polls
  • gapminder
  • greenhouse_gases
  • historic_co2
  • movielens
  • murders
  • nyc_regents_scores
  • olive
  • polls_us_election_2016
  • reported_heights
  • research_funding_rates
  • results_us_election_2016
  • stars
  • temp_carbon
  • trump_tweets
  • us_contagious_diseases

2) Setup your analysis environment [10 pts]

Once you’ve chosen a dataset, start setting up your analysis environment by following these steps:

  1. Open RStudio and create a new project called “hw6-lastName”, replacing “lastName” with your last name.
  2. Download the hw6.Rmd template script and place it in RStudio project folder you just created.
  3. In the “yaml header” at the top of this file, fill out your name, and give your project a title. Your title must include the dataset name and the package from which the dataset is sourced. For example, if you are using the admissions dataset in the dslabs package, an example title might be: “Summary of the admissions dataset from the dslabs package”
  4. Create a .R file and save it in your project folder as explore.R - we’ll use this file to explore your dataset later.

3) Inspect your data [10 pts]

Now that your environment is set up, open up your explore.R file and begin exploring your dataset. Be sure to load the library that contains the dataset.

Write some code to preview and summarize the dataset using some of the methods we’ve seen in class and in the lessons on data frames and data wrangling. You should be able to quickly get an understanding of what variables are included in the data frame and their nature. Consider the following questions in your exploration:

  • What type of data is each variable?
  • What is the total size of the data frame?
  • What are the boundaries of each period of observation (e.g. what time period do the observations in the data frame span)?
  • Do any variables have missing values, and if so which ones have more NAs than others? Why might that be the case?

Do not brush this step off - the more thoroughly you inspect your dataset, the easier (and better) you data exploration will be. This will be important for step 4, and absolutely critical for step 5. Make sure you take the time to develop an understanding of the variables in your dataset as it is nearly impossible to imagine what summary plots might be worth creating otherwise.

4) Document and summarize your dataset [15 pts]

Open up your “hw6.Rmd” file and write a paragraph in the section labeled # Data description describing of your dataset. Include the following:

  1. A brief overview of the subject matter of the dataset. What is this dataset about?
  2. A description of the data source. Read the package help files to identify as much information as possible about how the data were collected (when, where, by whom, etc.) and whether the data have already been pre-processed or not. If the data come from a published paper, include a citation to that paper.
  3. A summary table of each variable in the dataset. Each row should be a variable, and the columns should be the variable name and a short description of the variable (the “hw6.Rmd” template already has this table started for you).

5) Make plots [30 pts (15 per plot)]

Now that you have a basic understanding of the dataset, make some plots to explore the variables in the data and their potential relationships. You may use base R plotting functions or the ggplot2 library to make your figures, but you must make at least two figures, including:

  1. A scatterplot of involving at least two variables.
  2. A histogram or barplot involving at least one variable.

You can choose to plot whichever variables you wish, but you must be able to interpret the results of your plot. I recommend that you first make your plots in your explore.R file to iterate on your code until the plots are in their final form you desire. Then copy the code for your final plots over to the appropriate code chunks in your “hw6.Rmd” file.

6) Interpret your plots [15 pts]

Below each figure, write a description and interpretation of your plot. Make sure you address at least the following questions:

  1. Describe what variables you are plotting and why.
  2. Describe the primary relationship / trend / information you hope the reader will gain from your visualization.

7) Publish your writeup [10 pts]

Once you have completed your analysis, compile the “hw6.Rmd” file by opening it in RStudio and clicking the “knit” button. In the upper-right of the window that opens showing your compiled hw6.html file, click on the “Publish” button and publish your file to http://rpubs.com/. You should create an account with rpubs if you have not already. Once published, you can update it by making edits to your “hw6.Rmd” file, clicking the “knit” button again, and then clicking the “Republish” button in the upper-right of the RStudio preview window.

For a visual overview of the publishing process, refer to the instructions in the youtube video in the lesson on Communicating Information.

8) Submit your files on Blackboard [5 pts]

After publishing your “hw6.Rmd” file, create a zip file of all files in your R project folder for this assignment and submit the zip file on Blackboard by the due deadline. Include a link to your published report on rpubs in your Blackboard submission


EMSE 6574, Sec. 11: Programming for Analytics (Fall 2019)
George Washington University | School of Engineering & Applied Science
Dr. John Paul Helveston | jph@gwu.edu | Mondays | 6:10–8:40 PM | Phillips Hall 108 | |
This work is licensed under a Creative Commons Attribution 4.0 International License.
See the licensing page for more details about copyright information.
Content 2019 John Paul Helveston