Data Frames

.leftcol30[
<center>
<img src="https://github.com/emse-p4a-gwu/emse-p4a-gwu.github.io/raw/master/images/p4a_hex_sticker.png" width=250>
</center>
]

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M496 128v16a8 8 0 0 1-8 8h-24v12c0 6.627-5.373 12-12 12H60c-6.627 0-12-5.373-12-12v-12H24a8 8 0 0 1-8-8v-16a8 8 0 0 1 4.941-7.392l232-88a7.996 7.996 0 0 1 6.118 0l232 88A8 8 0 0 1 496 128zm-24 304H40c-13.255 0-24 10.745-24 24v16a8 8 0 0 0 8 8h464a8 8 0 0 0 8-8v-16c0-13.255-10.745-24-24-24zM96 192v192H60c-6.627 0-12 5.373-12 12v20h416v-20c0-6.627-5.373-12-12-12h-36V192h-64v192h-64V192h-64v192h-64V192H96z"/></svg> EMSE 4571: Intro to Programming for Analytics
### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M224 256c70.7 0 128-57.3 128-128S294.7 0 224 0 96 57.3 96 128s57.3 128 128 128zm89.6 32h-16.7c-22.2 10.2-46.9 16-72.9 16s-50.6-5.8-72.9-16h-16.7C60.2 288 0 348.2 0 422.4V464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48v-41.6c0-74.2-60.2-134.4-134.4-134.4z"/></svg> John Paul Helveston
### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M0 464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V192H0v272zm320-196c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM192 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM64 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zM400 64h-48V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H160V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H48C21.5 64 0 85.5 0 112v48h448v-48c0-26.5-21.5-48-48-48z"/></svg> March 24, 2022
]

---

background-color: #fff
class: center, middle

---

background-color: #fff
class: center, middle

---

# Revised late policy for HW 9-12

<br>

### - Submissions by **6am** on due date: _full credit_
### - Submissions by **6am** on following **Monday** (3 days late): _50% credit_
### - Later sumissions: _not graded_ (i.e. a 0)

---

# [AMG Grading](https://p4a.seas.gwu.edu/2022-Spring/syllabus.html#72_AMG_Grading)

---

# Before we start

Make sure you have these packages installed and loaded:

```r
install.packages("stringr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("readr")
install.packages("here")
```

(At the top of the `notes_blank.R` file)

Remember: you only need to install them once!

---

## "The purpose of computing is insight, not numbers"
### - [Richard Hamming](https://en.wikipedia.org/wiki/Richard_Hamming)

]

]

---

# Week 9: .fancy[Data Frames]

### 1. Basics
### 2. Slicing

### BREAK

### 3. External data

---

# Week 9: .fancy[Data Frames]

### 1. .orange[Basics]
### 2. Slicing

### BREAK

### 3. External data

---

# The data frame...in Excel

---

# The data frame...in R

```r
beatles <- tibble(
    firstName   = c("John", "Paul", "Ringo", "George"),
    lastName    = c("Lennon", "McCartney", "Starr", "Harrison"),
    instrument  = c("guitar", "bass", "drums", "guitar"),
    yearOfBirth = c(1940, 1942, 1940, 1943),
    deceased    = c(TRUE, FALSE, FALSE, TRUE)
)

beatles
```

```
#> # A tibble: 4 × 5
#>   firstName lastName  instrument yearOfBirth deceased
#>   <chr>     <chr>     <chr>            <dbl> <lgl>   
#> 1 John      Lennon    guitar            1940 TRUE    
#> 2 Paul      McCartney bass              1942 FALSE   
#> 3 Ringo     Starr     drums             1940 FALSE   
#> 4 George    Harrison  guitar            1943 TRUE
```

---

# The data frame...in RStudio

```r
view(beatles)
```

]

---

## **Columns**: _Vectors_ of values (must be same data type)

```r
beatles
```

Extract a column using `$`

```r
beatles$firstName
```

```
#> [1] "John"   "Paul"   "Ringo"  "George"
```

---

## **Rows**: Information about individual observations

Information about _John Lennon_ is in the first row:

```r
beatles[1,]
```

```
#> # A tibble: 1 × 5
#>   firstName lastName instrument yearOfBirth deceased
#>   <chr>     <chr>    <chr>            <dbl> <lgl>   
#> 1 John      Lennon   guitar            1940 TRUE
```

Information about _Paul McCartney_ is in the second row:

```r
beatles[2,]
```

```
#> # A tibble: 1 × 5
#>   firstName lastName  instrument yearOfBirth deceased
#>   <chr>     <chr>     <chr>            <dbl> <lgl>   
#> 1 Paul      McCartney bass              1942 FALSE
```

---

## Make a data frame with `data.frame()`

```r
beatles <- data.frame(
    firstName   = c("John", "Paul", "Ringo", "George"),
    lastName    = c("Lennon", "McCartney", "Starr", "Harrison"),
    instrument  = c("guitar", "bass", "drums", "guitar"),
    yearOfBirth = c(1940, 1942, 1940, 1943),
    deceased    = c(TRUE, FALSE, FALSE, TRUE)
)
```

```r
beatles
```

```
#>   firstName  lastName instrument yearOfBirth deceased
#> 1      John    Lennon     guitar        1940     TRUE
#> 2      Paul McCartney       bass        1942    FALSE
#> 3     Ringo     Starr      drums        1940    FALSE
#> 4    George  Harrison     guitar        1943     TRUE
```

---

## Make a data frame with `tibble()`

```r
library(dplyr)
```

```r
beatles
```

---

## Why I use `tibble()` instead of `data.frame()`

1. The `tibble()` shows the **dimensions** and **data type**.

2. A tibble will only print the first few rows of data when you enter the object name
Example: `faithful` vs. `as_tibble(faithful)`

3. Columns of class `character` are _never_ converted into factors (don't worry about this for now...just know that tibbles make life easier when dealing with character type columns).

**Note**: I use the word **"data frame"** to refer to both `tibble()` and `data.frame()` objects

---

## Data frame vectors must have the same length

```r
beatles <- tibble(
*   firstName   = c("John", "Paul", "Ringo", "George", "Bob"), # Added "Bob"
    lastName    = c("Lennon", "McCartney", "Starr", "Harrison"),
    instrument  = c("guitar", "bass", "drums", "guitar"),
    yearOfBirth = c(1940, 1942, 1940, 1943),
    deceased    = c(TRUE, FALSE, FALSE, TRUE)
)
```

```
#> Error:
#> ! Tibble columns must have compatible sizes.
#> • Size 5: Existing data.
#> • Size 4: Column `lastName`.
#> ℹ Only values of size one are recycled.
```

---

## Use `NA` for missing values

```r
beatles <- tibble(
    firstName   = c("John", "Paul", "Ringo", "George", "Bob"), 
*   lastName    = c("Lennon", "McCartney", "Starr", "Harrison", NA), # Added NAs
*   instrument  = c("guitar", "bass", "drums", "guitar", NA),
*   yearOfBirth = c(1940, 1942, 1940, 1943, NA),
*   deceased    = c(TRUE, FALSE, FALSE, TRUE, NA)
)
```

```r
beatles
```

```
#> # A tibble: 5 × 5
#>   firstName lastName  instrument yearOfBirth deceased
#>   <chr>     <chr>     <chr>            <dbl> <lgl>   
#> 1 John      Lennon    guitar            1940 TRUE    
#> 2 Paul      McCartney bass              1942 FALSE   
#> 3 Ringo     Starr     drums             1940 FALSE   
#> 4 George    Harrison  guitar            1943 TRUE    
#> 5 Bob       <NA>      <NA>                NA NA
```

---

# Dimensions: `nrow()`, `ncol()`, & `dim()`

```r
nrow(beatles) # Number of rows
```

```
#> [1] 5
```

```r
ncol(beatles) # Number of columns
```

```
#> [1] 5
```

```r
dim(beatles)  # Number of rows and columns
```

```
#> [1] 5 5
```

---

### .center[Use `names()` or `colnames()` to see the available variables]

Get the names of columns:

```r
names(beatles)
```

```
#> [1] "firstName"   "lastName"    "instrument"  "yearOfBirth" "deceased"
```

```r
colnames(beatles)
```

```
#> [1] "firstName"   "lastName"    "instrument"  "yearOfBirth" "deceased"
```

Get the names of rows (rarely needed):

```r
rownames(beatles)
```

```
#> [1] "1" "2" "3" "4" "5"
```

---

# Changing the column names

Change the column names with `names()` or `colnames()`:

```r
names(beatles) <- c('one', 'two', 'three', 'four', 'five')
beatles
```

```
#> # A tibble: 5 × 5
#>   one    two       three   four five 
#>   <chr>  <chr>     <chr>  <dbl> <lgl>
#> 1 John   Lennon    guitar  1940 TRUE 
#> 2 Paul   McCartney bass    1942 FALSE
#> 3 Ringo  Starr     drums   1940 FALSE
#> 4 George Harrison  guitar  1943 TRUE 
#> 5 Bob    <NA>      <NA>      NA NA
```

---

# Changing the column names

Make all the column names upper-case:

```r
colnames(beatles) <- stringr::str_to_upper(colnames(beatles))
beatles
```

```
#> # A tibble: 5 × 5
#>   FIRSTNAME LASTNAME  INSTRUMENT YEAROFBIRTH DECEASED
#>   <chr>     <chr>     <chr>            <dbl> <lgl>   
#> 1 John      Lennon    guitar            1940 TRUE    
#> 2 Paul      McCartney bass              1942 FALSE   
#> 3 Ringo     Starr     drums             1940 FALSE   
#> 4 George    Harrison  guitar            1943 TRUE    
#> 5 Bob       <NA>      <NA>                NA NA
```

---

## Combine data frames by columns using `bind_cols()`

Note: `bind_cols()` is from the **dplyr** library

```r
names <- tibble(
    firstName = c("John", "Paul", "Ringo", "George"),
    lastName  = c("Lennon", "McCartney", "Starr", "Harrison"))

instruments <- tibble(
    instrument = c("guitar", "bass", "drums", "guitar"))
```

```r
bind_cols(names, instruments)
```

```
#> # A tibble: 4 × 3
#>   firstName lastName  instrument
#>   <chr>     <chr>     <chr>     
#> 1 John      Lennon    guitar    
#> 2 Paul      McCartney bass      
#> 3 Ringo     Starr     drums     
#> 4 George    Harrison  guitar
```

---

## Combine data frames by rows using `bind_rows()`

Note: `bind_rows()` is from the **dplyr** library

```r
members1 <- tibble(
    firstName = c("John", "Paul"),
    lastName  = c("Lennon", "McCartney"))

members2 <- tibble(
    firstName = c("Ringo", "George"),
    lastName  = c("Starr", "Harrison"))
```

```r
bind_rows(members1, members2)
```

```
#> # A tibble: 4 × 2
#>   firstName lastName 
#>   <chr>     <chr>    
#> 1 John      Lennon   
#> 2 Paul      McCartney
#> 3 Ringo     Starr    
#> 4 George    Harrison
```

---

## Note: `bind_rows()` requires the **same** columns names:

```r
*colnames(members2) <- c("firstName", "LastName")
bind_rows(members1, members2)
```

```
#> # A tibble: 4 × 3
#>   firstName lastName  LastName
#>   <chr>     <chr>     <chr>   
#> 1 John      Lennon    <NA>    
#> 2 Paul      McCartney <NA>    
#> 3 Ringo     <NA>      Starr   
#> 4 George    <NA>      Harrison
```

Note how `<NA>`s were created

---

## Your turn

Answer these questions using the `animals_farm` and `animals_pet` data frames:

1. Write code to find how many _rows_ are in the `animals_farm` data frame?
2. Write code to find how many _columns_ are in the `animals_pet` data frame?
3. Create a new data frame, `animals`, by combining `animals_farm` and `animals_pet`.
4. Change the column names of `animals` to title case.
5. Add a new column to `animals` called `type` that tells if an animal is a `"farm"` or `"pet"` animal.

---

# Week 9: .fancy[Data Frames]

### 1. Basics
### 2. .orange[Slicing]

### BREAK

### 3. External data

---

## Access data frame columns using the `$` symbol

```r
beatles$firstName
```

```
#> [1] "John"   "Paul"   "Ringo"  "George"
```

```r
beatles$lastName
```

```
#> [1] "Lennon"    "McCartney" "Starr"     "Harrison"
```

---

# Creating new variables with the `$` symbol

Add the hometown of the bandmembers:

```r
beatles$hometown <- 'Liverpool'
beatles
```

```
#> # A tibble: 4 × 6
#>   firstName lastName  instrument yearOfBirth deceased hometown 
#>   <chr>     <chr>     <chr>            <dbl> <lgl>    <chr>    
#> 1 John      Lennon    guitar            1940 TRUE     Liverpool
#> 2 Paul      McCartney bass              1942 FALSE    Liverpool
#> 3 Ringo     Starr     drums             1940 FALSE    Liverpool
#> 4 George    Harrison  guitar            1943 TRUE     Liverpool
```

---

# Creating new variables with the `$` symbol

Add a new `alive` variable:

```r
beatles$alive <- c(FALSE, TRUE, TRUE, FALSE)
beatles
```

```
#> # A tibble: 4 × 7
#>   firstName lastName  instrument yearOfBirth deceased hometown  alive
#>   <chr>     <chr>     <chr>            <dbl> <lgl>    <chr>     <lgl>
#> 1 John      Lennon    guitar            1940 TRUE     Liverpool FALSE
#> 2 Paul      McCartney bass              1942 FALSE    Liverpool TRUE 
#> 3 Ringo     Starr     drums             1940 FALSE    Liverpool TRUE 
#> 4 George    Harrison  guitar            1943 TRUE     Liverpool FALSE
```

---

## You can compute new variables from current ones

Compute and add the age of the bandmembers:

```r
beatles$age <- 2020 - beatles$yearOfBirth
beatles
```

```
#> # A tibble: 4 × 8
#>   firstName lastName  instrument yearOfBirth deceased hometown  alive   age
#>   <chr>     <chr>     <chr>            <dbl> <lgl>    <chr>     <lgl> <dbl>
#> 1 John      Lennon    guitar            1940 TRUE     Liverpool FALSE    80
#> 2 Paul      McCartney bass              1942 FALSE    Liverpool TRUE     78
#> 3 Ringo     Starr     drums             1940 FALSE    Liverpool TRUE     80
#> 4 George    Harrison  guitar            1943 TRUE     Liverpool FALSE    77
```

---

## Access elements by index: `DF[row, column]`

General form for indexing elements:

```r
DF[row, column]
```

Select the element in row 1, column 2:

```r
beatles[1, 2]
```

```
#> # A tibble: 1 × 1
#>   lastName
#>   <chr>   
#> 1 Lennon
```

]

Select the elements in rows 1 & 2 and columns 2 & 3:

```r
beatles[c(1, 2), c(2, 3)]
```

```
#> # A tibble: 2 × 2
#>   lastName  instrument
#>   <chr>     <chr>     
#> 1 Lennon    guitar    
#> 2 McCartney bass
```

]

---

## Leave row or column "blank" to select all

```r
beatles[c(1, 2),] # Selects all COLUMNS for rows 1 & 2
```

```
#> # A tibble: 2 × 5
#>   firstName lastName  instrument yearOfBirth deceased
#>   <chr>     <chr>     <chr>            <dbl> <lgl>   
#> 1 John      Lennon    guitar            1940 TRUE    
#> 2 Paul      McCartney bass              1942 FALSE
```

```r
beatles[,c(1, 2)] # Selects all ROWS for columns 1 & 2
```

```
#> # A tibble: 4 × 2
#>   firstName lastName 
#>   <chr>     <chr>    
#> 1 John      Lennon   
#> 2 Paul      McCartney
#> 3 Ringo     Starr    
#> 4 George    Harrison
```

---

## Negative indices exclude row / column

```r
beatles[-1, ] # Select all ROWS except the first
```

```
#> # A tibble: 3 × 5
#>   firstName lastName  instrument yearOfBirth deceased
#>   <chr>     <chr>     <chr>            <dbl> <lgl>   
#> 1 Paul      McCartney bass              1942 FALSE   
#> 2 Ringo     Starr     drums             1940 FALSE   
#> 3 George    Harrison  guitar            1943 TRUE
```

```r
beatles[,-1] # Select all COLUMNS except the first
```

```
#> # A tibble: 4 × 4
#>   lastName  instrument yearOfBirth deceased
#>   <chr>     <chr>            <dbl> <lgl>   
#> 1 Lennon    guitar            1940 TRUE    
#> 2 McCartney bass              1942 FALSE   
#> 3 Starr     drums             1940 FALSE   
#> 4 Harrison  guitar            1943 TRUE
```

---

# You can select columns by their names

Note: you don't need the comma to select an entire column

One column

```r
beatles['firstName']
```

```
#> # A tibble: 4 × 1
#>   firstName
#>   <chr>    
#> 1 John     
#> 2 Paul     
#> 3 Ringo    
#> 4 George
```

]

<br>Multiple columns

```r
beatles[c('firstName', 'lastName')]
```

```
#> # A tibble: 4 × 2
#>   firstName lastName 
#>   <chr>     <chr>    
#> 1 John      Lennon   
#> 2 Paul      McCartney
#> 3 Ringo     Starr    
#> 4 George    Harrison
```

]

---

## Use logical indices to _filter_ rows

**Which Beatles members are still alive?**<br>Create a logical vector using the `deceased` column:

```r
beatles$deceased == FALSE
```

```
#> [1] FALSE  TRUE  TRUE FALSE
```

Insert this logical vector in the ROW position of `beatles[,]`:

```r
beatles[beatles$deceased == FALSE,]
```

```
#> # A tibble: 2 × 5
#>   firstName lastName  instrument yearOfBirth deceased
#>   <chr>     <chr>     <chr>            <dbl> <lgl>   
#> 1 Paul      McCartney bass              1942 FALSE   
#> 2 Ringo     Starr     drums             1940 FALSE
```

---

## Your turn

Answer these questions using the `beatles` data frame:

1. Create a new column, `playsGuitar`, which is `TRUE` if the band member plays the guitar and `FALSE` otherwise.
2. Filter the data frame to select only the rows for the band members who have four-letter first names.
3. Create a new column, `fullName`, which contains the band member's first and last name separated by a space (e.g. `"John Lennon"`)

---

# .fancy[Break]

---

# Week 9: .fancy[Data Frames]

### 1. Basics
### 2. Slicing

### BREAK

### 3. .orange[External data]

---

# Getting data into R

<br>

## Options:

## 1. Load external packages
## 2. Read in external files (usually a `.csv`* file)

<br>

*csv = "comma-separated values"

---

## Data from an R package

```r
library(ggplot2)
```

See which data frames are available in a package:

```r
data(package = "ggplot2")
```

---

# Find out about package data sets with `?`

```r
?msleep
```

```
msleep {ggplot2}

An updated and expanded version of the mammals sleep dataset

Description

This is an updated and expanded version of the mammals sleep dataset. Updated sleep times and weights were taken from V. M. Savage and G. B. West. A quantitative, theoretical framework for understanding mammalian sleep. Proceedings of the National Academy of Sciences, 104 (3):1051-1056, 2007.
```

---

# Previewing data frames: `msleep`

Look at the data in a "spreadsheet"-like way:

```r
view(msleep)
```

This is "read-only" so you can't corrupt the data 😄

---

# My favorite quick summary: `glimpse()`

Preview each variable with `str()` or `glimpse()`

```r
glimpse(msleep)
```

```
#> Rows: 83
#> Columns: 11
#> $ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater short-tailed shrew", "Cow", "Three-toed sloth", "Northern fur seal", "Vesper mouse", "Dog", "Roe deer", "Goat", "Guinea pig", "Grivet", "Chinchilla", "Star-nosed mole", "Afri…
#> $ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bradypus", "Callorhinus", "Calomys", "Canis", "Capreolus", "Capri", "Cavis", "Cercopithecus", "Chinchilla", "Condylura", "Cricetomys", "Cryptotis", "Dasypus", "Dendrohyrax",…
#> $ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "carni", NA, "carni", "herbi", "herbi", "herbi", "omni", "herbi", "omni", "omni", "omni", "carni", "herbi", "omni", "herbi", "insecti", "herbi", "herbi", "omni", "omni", "herb…
#> $ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Artiodactyla", "Pilosa", "Carnivora", "Rodentia", "Carnivora", "Artiodactyla", "Artiodactyla", "Rodentia", "Primates", "Rodentia", "Soricomorpha", "Rodentia", "Soricomorpha"…
#> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "domesticated", "lc", "lc", "domesticated", "lc", "domesticated", "lc", NA, "lc", "lc", "lc", "lc", "en", "lc", "domesticated", "domesticated", "lc", "lc", NA, "domesticated",…
#> $ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5.3, 9.4, 10.0, 12.5, 10.3, 8.3, 9.1, 17.4, 5.3, 18.0, 3.9, 19.7, 2.9, 3.1, 10.1, 10.9, 14.9, 12.5, 9.8, 1.9, 2.7, 6.2, 6.3, 8.0, 9.5, 3.3, 19.4, 10.1, 14.2, 14.3, 12.8, 1…
#> $ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, 0.7, 1.5, 2.2, 2.0, 1.4, 3.1, 0.5, 4.9, NA, 3.9, 0.6, 0.4, 3.5, 1.1, NA, 3.2, 1.1, 0.4, 0.1, 1.5, 0.6, 1.9, 0.9, NA, 6.6, 1.2, 1.9, 3.1, NA, 1.4, 2.0, NA, NA, 0.9, NA, 0.…
#> $ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, NA, 0.3333333, NA, NA, 0.2166667, NA, 0.1166667, NA, NA, 0.1500000, 0.3833333, NA, 0.3333333, NA, 0.1166667, 1.0000000, NA, 0.2833333, NA, NA, 0.4166667, 0.5500000, NA, NA…
#> $ awake        <dbl> 11.90, 7.00, 9.60, 9.10, 20.00, 9.60, 15.30, 17.00, 13.90, 21.00, 18.70, 14.60, 14.00, 11.50, 13.70, 15.70, 14.90, 6.60, 18.70, 6.00, 20.10, 4.30, 21.10, 20.90, 13.90, 13.10, 9.10, 11.50, 14.20, 22.10, 21.35, 17.80, 17.70, 16.0…
#> $ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0.09820, 0.11500, 0.00550, NA, 0.00640, 0.00100, 0.00660, 0.00014, 0.01080, 0.01230, 0.00630, 4.60300, 0.00030, 0.65500, 0.41900, 0.00350, 0.11500, NA, 0.02560, 0.00500, N…
#> $ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.045, 14.000, 14.800, 33.500, 0.728, 4.750, 0.420, 0.060, 1.000, 0.005, 3.500, 2.950, 1.700, 2547.000, 0.023, 521.000, 187.000, 0.770, 10.000, 0.071, 3.300, 0.200, 899.995, …
```

]

---

## Also very useful for quick checks: `head()` and `tail()`

View the **first** 6 rows with `head()`

```r
head(msleep)
```

```
#> # A tibble: 6 × 11
#>   name                       genus      vore  order        conservation sleep_total sleep_rem sleep_cycle awake  brainwt  bodywt
#>   <chr>                      <chr>      <chr> <chr>        <chr>              <dbl>     <dbl>       <dbl> <dbl>    <dbl>   <dbl>
#> 1 Cheetah                    Acinonyx   carni Carnivora    lc                  12.1      NA        NA      11.9 NA        50    
#> 2 Owl monkey                 Aotus      omni  Primates     <NA>                17         1.8      NA       7    0.0155    0.48 
#> 3 Mountain beaver            Aplodontia herbi Rodentia     nt                  14.4       2.4      NA       9.6 NA         1.35 
#> 4 Greater short-tailed shrew Blarina    omni  Soricomorpha lc                  14.9       2.3       0.133   9.1  0.00029   0.019
#> 5 Cow                        Bos        herbi Artiodactyla domesticated         4         0.7       0.667  20    0.423   600    
#> 6 Three-toed sloth           Bradypus   herbi Pilosa       <NA>                14.4       2.2       0.767   9.6 NA         3.85
```

]]

View the **last** 6 rows with `tail()`

```r
tail(msleep)
```

```
#> # A tibble: 6 × 11
#>   name                 genus    vore  order        conservation sleep_total sleep_rem sleep_cycle awake brainwt  bodywt
#>   <chr>                <chr>    <chr> <chr>        <chr>              <dbl>     <dbl>       <dbl> <dbl>   <dbl>   <dbl>
#> 1 Tenrec               Tenrec   omni  Afrosoricida <NA>                15.6       2.3      NA       8.4  0.0026   0.9  
#> 2 Tree shrew           Tupaia   omni  Scandentia   <NA>                 8.9       2.6       0.233  15.1  0.0025   0.104
#> 3 Bottle-nosed dolphin Tursiops carni Cetacea      <NA>                 5.2      NA        NA      18.8 NA      173.   
#> 4 Genet                Genetta  carni Carnivora    <NA>                 6.3       1.3      NA      17.7  0.0175   2    
#> 5 Arctic fox           Vulpes   carni Carnivora    <NA>                12.5      NA        NA      11.5  0.0445   3.38 
#> 6 Red fox              Vulpes   carni Carnivora    <NA>                 9.8       2.4       0.35   14.2  0.0504   4.23
```

]]

---

# Importing an external data file

<br>

- **DO NOT** double-click it!
- **DO NOT** open it in Excel!

Excel can **corrupt** your data!
]

- Make a copy 
- Open the copy
]

---

# Steps to importing external data files

## 1. Create a path to the data

```r
library(here)
*pathToData <- here('data', 'data.csv')
pathToData
```

```
#> [1] "/Users/jhelvy/gh/0gw/P4A/2022-Spring/class/9-data-frames/data/data.csv"
```

## 2. Import the data

```r
library(readr)
*df <- read_csv(pathToData)
```

---

## Using the **here** package to make file paths

The `here()` function builds the path to your **root** to your _working directory_ <br>(this is where your `.Rproj` file lives!)

```r
here()
```

```
#> [1] "/Users/jhelvy/gh/0gw/P4A/2022-Spring/class/9-data-frames"
```

The `here()` function builds the path to files _inside_ your working directory

```r
path_to_data <- here('data', 'data.csv')
path_to_data
```

```
#> [1] "/Users/jhelvy/gh/0gw/P4A/2022-Spring/class/9-data-frames/data/data.csv"
```

---

# Avoid hard-coding file paths!

### (they can break on different computers)

```r
path_to_data <- 'data/data.csv'
path_to_data
```

```
#> [1] "data/data.csv"
```

# 💩💩💩

---

## Use the **here** package to make file paths

]]

<center><br>
<img src="images/horst_monsters_here.png">
</center>Art by [Allison Horst](https://www.allisonhorst.com/)

]

---

# Back to reading in data

```r
path_to_data <- here('data', 'data.csv')
*data <- read_csv(path_to_data)
```

<br>

**Important**: Use `read_csv()` instead of `read.csv()`
---

## Your turn

1) Use the `here()` and `read_csv()` functions to load the `data.csv` file that is in the `data` folder. Name the data frame object `df`.

2) Use the `df` object to answer the following questions:

- How many rows and columns are in the data frame?
- What type of data is each column?
- Preview the different columns - what do you think this data is about? What might one row represent?
- How many unique airports are in the data frame?
- What is the earliest and latest observation in the data frame?
- What is the lowest and highest cost of any one repair in the data frame?

]

---

## Next week: better data wrangling with **dplyr**

<center>
<img src="images/horst_monsters_data_wrangling.png" width="600">
</center>Art by [Allison Horst](https://www.allisonhorst.com/)

---

# Select rows with `filter()`

Example: Filter rows to find which Beatles members are still alive?

**Base R**:

```r
beatles[beatles$deceased == FALSE,]
```

**dplyr**:

```r
filter(beatles, deceased == FALSE)
```

---

# In 2 weeks: plotting with **ggplot2**

## Translate _data_...

```
#> # A tibble: 11 × 2
#>    brainwt   bodywt
#>      <dbl>    <dbl>
#>  1 0.001      0.06 
#>  2 0.0066     1    
#>  3 0.00014    0.005
#>  4 0.0108     3.5  
#>  5 0.0123     2.95 
#>  6 0.0063     1.7  
#>  7 4.60    2547    
#>  8 0.0003     0.023
#>  9 0.655    521    
#> 10 0.419    187    
#> 11 0.0035     0.77
```

]

## ...into _information_

]

---

# A note about HW 9

- You have what you need to start now.
- It will be _much_ easier if you use the **dplyr** functions (i.e. read ahead).

---

## Extra Practice!

1. Install the **dslabs** package. 
2. Load the package, then use `data(package = "dslabs")` to see the different data sets in this package. 
3. Pick one. 
4. Answer these questions:

- What is the dataset about?
- How many observations are in the data frame?
- What is the original source of the data?
- What type of data is each variable?
- Find one thing interesting about it to share.