Homework 11 - Programming with Data

Due: Apr 19 by 11:59pm

Submission Instructions: Create a zip file of all the files in your R project folder for this assignment, then submit your zip file on the corresponding assignment submission on Blackboard.

Weight: This assignment is worth 4% of your final grade.

Purpose: The purposes of this assignment are:

Be able to compose functions using the ‘tidy evaluation’ syntax to work with data.

Be able to parse through lists using purrr::map() functions.

Assessment: Each question indicates the % of the assignment grade, summing to 100%. The credit for each question will be assigned as follows:

0% for not attempting a response.

50% for attempting the question but with major errors.

75% for attempting the question but with minor errors.

100% for correctly answering the question.

The reflection portion is always worth 10% and graded for completion.

Rules:

Problems marked SOLO may not be worked on with other classmates, though you may consult instructors for help.

For problems marked COLLABORATIVE, you may work in groups of up to 3 students who are in this course this semester. You may not split up the work – everyone must work on every problem. And you may not simply copy any code but rather truly work together and submit your own solutions.

Readings

The readings from the last week will serve as a helpful reference as you complete this assignment. You can review them here:

Sections 26.3 (Data Frame Functions) & 26.4 (Plot Functions) in Hadley Wickham’s book R4DS: https://r4ds.hadley.nz/functions.html#data-frame-functions

This blog post on iteration with the {purrr} package: https://www.rebeccabarter.com/blog/2019-08-19_purrr

Using AI tools

On this assignment, you are encouraged to use ChatGPT and other AI tools (e.g. Github Copilot). But don’t just blindly copy-paste code. The code provided by these tools is not perfect, and you will likely need to modify it to get the correct solution. If you do use an AI tools, you must include the prompt(s) you used (using a comment with #) to recieve full credit. If you had to change anything to your prompt to get better results, write that down too in your code with a comment. Learn to use tools like ChatGPT as a learning assistant - a tool to help you accomplish the task - rather than just a solutions manual. One version of using it makes you a better and more efficient coder, the other robs you of that.

1) Staying organized [SOLO, 5%]

Download and use this template for your assignment. Inside the “hw11” folder, open and edit the R script called hw11.R and fill out your name, GW netID, and the names of anyone you worked with on this assignment.

Data Frame Functions

For questions 2 - 5, after writing your function, demonstrate it using a data frame of your choice from the dslabs package. For example, for question 2, you could use var_summary(dslabs::movielens, rating) (so, obviously, you should use a different example).

2) `var_summary(df, var)` [SOLO, 10%]

Write the function var_summary(df, var) that takes a data frame (df) and a variable (var) as inputs, and returns the minimum, maximum, mean, and median value of that variable. The function should remove any NA values in var when computing these summary statistics. The object returned should be a single data frame / tibble (not a vector).

3) `group_summary(df, var, group_var)` [SOLO, 10%]

Write the function group_summary(df, var, group_var) that takes a data frame (df), a variable (var), and a grouping variable (group_var) as inputs, and returns a summary table showing the count, mean, and standard deviation of the variable var grouped by group_var. The function should remove any NA values in var when computing these summary statistics. The object returned should be a single data frame / tibble (not a vector).

4) `var_hist(df, var, bins)` [SOLO, 10%]

Write the function var_hist(df, var, bins) that takes a data frame (df), a variable (var), and the number of bins (bins) as inputs, and returns a histogram of that variable with a user-specified number of bins as a ggplot object. The default number of bins should be 30.

5) `scatterplot(df, x, y)` [SOLO, 10%]

Write the function scatterplot(df, x, y) that takes a data frame (df) and two variables (x and y) as inputs, and returns a scatter plot of those two variables as a ggplot object.

Iteration across lists with `purrr`

Problems using `word_list`

For these questions, we will work with the sentences vector that comes loaded with the stringr package (which is loaded when you load the tidyverse package):

library(tidyverse)

head(stringr::sentences)

#> [1] "The birch canoe slid on the smooth planks." 
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."     
#> [4] "These days a chicken leg is a rare dish."   
#> [5] "Rice is often served in round bowls."       
#> [6] "The juice of lemons makes fine punch."

This vector contains lots of random sentences. When we break those sentences into individual words using str_split(), we will get a list back where each item in the list is a vector of words:

word_list <- str_split(stringr::sentences, " ")

word_list[1:3]

#> [[1]]
#> [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
#> [8] "planks."
#> 
#> [[2]]
#> [1] "Glue"        "the"         "sheet"       "to"          "the"        
#> [6] "dark"        "blue"        "background."
#> 
#> [[3]]
#> [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."

We will use this word_list for questions 6 - 8.

6) [COLLABORATIVE, 5%]

Using map(), write R code to obtain a vector of how many words are in each item in word_list.

7) [COLLABORATIVE, 5%]

Using map(), write R code to obtain a vector of the total number of characters in each item in word_list.

8) [COLLABORATIVE, 5%]

Using map(), write R code to obtain a vector of the number of times the word "the" appears in each item in word_list. Your result should ignore casing, so both "the" and "The" should count.

Problems using `sw_people`

As we saw in class, the sw_people list contains a list of information about each character in Star Wars. You can load the list from the repurrrsive package:

library(repurrrsive)

We will use the sw_people and sw_films lists for questions 9 & 10.

9) [COLLABORATIVE, 10%]

Using map() and the sw_films list, write R code to obtain a vector of integers that contains the number of characters in each Star Wars film.

10) [COLLABORATIVE, 10%]

Using map_df(), create a data frame where each row represents a character from sw_people. The columns should contain the following:

name: The character’s name, as a character.
height: The character’s height, as a number
is_male: Whether the character’s gender is "male" (TRUE or FALSE)
film_count: The number of films they have appeared in, as an integer.

11) [SOLO, 10%]

For the last problem, write your own homework question that requires the student (you) to use map() in the solution. You can use any lists of data you want for your question (e.g. sw_people, sw_films, got_chars, etc.). Then provide the answer to your question. As with all the other questions, if you use an AI tool to help you create and / or solve your question, include the prompt you used and comment on any changes you had to make to improve your outcome.

12) Read and reflect [SOLO, 10%]

Read and reflect on the following readings to preview what we will be covering next:

Chapter 25 (Web scraping) in Hadley Wickham’s book R4DS: https://r4ds.hadley.nz/webscraping.html

This post on accessing and collecting data with APIs in R: https://statisticsglobe.com/api-in-r

Afterwards, in a comment (#) in your .R file, write a short reflection on what you’ve learned and any questions or points of confusion you have about what we’ve covered thus far. This can just few a few sentences related to this assignment, next week’s readings, things going on in the world that remind you something from class, etc. If there’s anything that jumped out at you, write it down.

Submit

Instructions for how to submit your assignment are at the top of this page.