class: middle, inverse .leftcol30[ <center> <img src="https://github.com/emse-p4a-gwu/emse-p4a-gwu.github.io/raw/master/images/p4a_hex_sticker.png" width=250> </center> ] .rightcol70[ # Week 10: .fancy[Data Frames] ###
EMSE 4571: Intro to Programming for Analytics ###
John Paul Helveston ###
March 30, 2023 ] --- # Before we start Make sure you have these packages installed and loaded: ```r install.packages("stringr") install.packages("dplyr") install.packages("ggplot2") install.packages("readr") install.packages("here") ``` (At the top of the `practice.R` file) Remember: you only need to install them once! --- .leftcol[ ## "The purpose of computing is _insight_, not numbers" ### - [Richard Hamming](https://en.wikipedia.org/wiki/Richard_Hamming) ] .rightcol[ <img src="images/Richard_Hamming.jpg" width="400"> ] --- class: inverse, middle # Week 10: .fancy[Data Frames] ### 1. Basics ### 2. Slicing ### BREAK ### 3. External data --- class: inverse, middle # Week 10: .fancy[Data Frames] ### 1. .orange[Basics] ### 2. Slicing ### BREAK ### 3. External data --- class: center # The data frame...in Excel <center> <img src="images/spreadsheet.png" width=700> </center> --- # The data frame...in R ```r beatles <- tibble( firstName = c("John", "Paul", "Ringo", "George"), lastName = c("Lennon", "McCartney", "Starr", "Harrison"), instrument = c("guitar", "bass", "drums", "guitar"), yearOfBirth = c(1940, 1942, 1940, 1943), deceased = c(TRUE, FALSE, FALSE, TRUE) ) beatles ``` ``` #> # A tibble: 4 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE #> 2 Paul McCartney bass 1942 FALSE #> 3 Ringo Starr drums 1940 FALSE #> 4 George Harrison guitar 1943 TRUE ``` --- # The data frame...in RStudio .leftcol[ ```r View(beatles) ``` ] <img src="images/dataframe.png" width="700"> --- ## **Columns**: _Vectors_ of values (must be same data type) ```r beatles ``` ``` #> # A tibble: 4 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE #> 2 Paul McCartney bass 1942 FALSE #> 3 Ringo Starr drums 1940 FALSE #> 4 George Harrison guitar 1943 TRUE ``` -- Extract a column using `$` ```r beatles$firstName ``` ``` #> [1] "John" "Paul" "Ringo" "George" ``` --- ## **Rows**: Information about individual observations -- Information about _John Lennon_ is in the first row: ```r beatles[1,] ``` ``` #> # A tibble: 1 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE ``` -- Information about _Paul McCartney_ is in the second row: ```r beatles[2,] ``` ``` #> # A tibble: 1 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 Paul McCartney bass 1942 FALSE ``` --- ## Make a data frame with `data.frame()` ```r beatles <- data.frame( firstName = c("John", "Paul", "Ringo", "George"), lastName = c("Lennon", "McCartney", "Starr", "Harrison"), instrument = c("guitar", "bass", "drums", "guitar"), yearOfBirth = c(1940, 1942, 1940, 1943), deceased = c(TRUE, FALSE, FALSE, TRUE) ) ``` -- ```r beatles ``` ``` #> firstName lastName instrument yearOfBirth deceased #> 1 John Lennon guitar 1940 TRUE #> 2 Paul McCartney bass 1942 FALSE #> 3 Ringo Starr drums 1940 FALSE #> 4 George Harrison guitar 1943 TRUE ``` --- ## Make a data frame with `tibble()` ```r library(dplyr) ``` ```r beatles <- tibble( firstName = c("John", "Paul", "Ringo", "George"), lastName = c("Lennon", "McCartney", "Starr", "Harrison"), instrument = c("guitar", "bass", "drums", "guitar"), yearOfBirth = c(1940, 1942, 1940, 1943), deceased = c(TRUE, FALSE, FALSE, TRUE) ) ``` -- ```r beatles ``` ``` #> # A tibble: 4 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE #> 2 Paul McCartney bass 1942 FALSE #> 3 Ringo Starr drums 1940 FALSE #> 4 George Harrison guitar 1943 TRUE ``` --- ## Why I use `tibble()` instead of `data.frame()` -- 1. The `tibble()` shows the **dimensions** and **data type**. -- 2. A tibble will only print the first few rows of data when you enter the object name Example: `faithful` vs. `as_tibble(faithful)` -- 3. Columns of class `character` are _never_ converted into factors (don't worry about this for now...just know that tibbles make life easier with strings). **Note**: I use the word **"data frame"** to refer to both `tibble()` and `data.frame()` objects --- ## Data frame vectors must have the same length ```r beatles <- tibble( * firstName = c("John", "Paul", "Ringo", "George", "Bob"), # Added "Bob" lastName = c("Lennon", "McCartney", "Starr", "Harrison"), instrument = c("guitar", "bass", "drums", "guitar"), yearOfBirth = c(1940, 1942, 1940, 1943), deceased = c(TRUE, FALSE, FALSE, TRUE) ) ``` ``` #> Error: #> ! Tibble columns must have compatible sizes. #> • Size 5: Existing data. #> • Size 4: Column `lastName`. #> ℹ Only values of size one are recycled. ``` --- ## Use `NA` for missing values ```r beatles <- tibble( firstName = c("John", "Paul", "Ringo", "George", "Bob"), * lastName = c("Lennon", "McCartney", "Starr", "Harrison", NA), # Added NAs * instrument = c("guitar", "bass", "drums", "guitar", NA), * yearOfBirth = c(1940, 1942, 1940, 1943, NA), * deceased = c(TRUE, FALSE, FALSE, TRUE, NA) ) ``` -- ```r beatles ``` ``` #> # A tibble: 5 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE #> 2 Paul McCartney bass 1942 FALSE #> 3 Ringo Starr drums 1940 FALSE #> 4 George Harrison guitar 1943 TRUE #> 5 Bob <NA> <NA> NA NA ``` --- # Dimensions: `nrow()`, `ncol()`, & `dim()` ```r nrow(beatles) # Number of rows ``` ``` #> [1] 5 ``` ```r ncol(beatles) # Number of columns ``` ``` #> [1] 5 ``` ```r dim(beatles) # Number of rows and columns ``` ``` #> [1] 5 5 ``` --- ### .center[Use `names()` or `colnames()` to see the available variables] Get the names of columns: ```r names(beatles) ``` ``` #> [1] "firstName" "lastName" "instrument" "yearOfBirth" "deceased" ``` ```r colnames(beatles) ``` ``` #> [1] "firstName" "lastName" "instrument" "yearOfBirth" "deceased" ``` -- Get the names of rows (rarely needed): ```r rownames(beatles) ``` ``` #> [1] "1" "2" "3" "4" "5" ``` --- # Changing the column names Change the column names with `names()` or `colnames()`: ```r names(beatles) <- c('one', 'two', 'three', 'four', 'five') beatles ``` ``` #> # A tibble: 5 × 5 #> one two three four five #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE #> 2 Paul McCartney bass 1942 FALSE #> 3 Ringo Starr drums 1940 FALSE #> 4 George Harrison guitar 1943 TRUE #> 5 Bob <NA> <NA> NA NA ``` --- # Changing the column names Make all the column names upper-case: ```r colnames(beatles) <- stringr::str_to_upper(colnames(beatles)) beatles ``` ``` #> # A tibble: 5 × 5 #> FIRSTNAME LASTNAME INSTRUMENT YEAROFBIRTH DECEASED #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE #> 2 Paul McCartney bass 1942 FALSE #> 3 Ringo Starr drums 1940 FALSE #> 4 George Harrison guitar 1943 TRUE #> 5 Bob <NA> <NA> NA NA ``` --- ## Combine data frames by columns using `bind_cols()` Note: `bind_cols()` is from the **dplyr** library ```r names <- tibble( firstName = c("John", "Paul", "Ringo", "George"), lastName = c("Lennon", "McCartney", "Starr", "Harrison")) instruments <- tibble( instrument = c("guitar", "bass", "drums", "guitar")) ``` -- ```r bind_cols(names, instruments) ``` ``` #> # A tibble: 4 × 3 #> firstName lastName instrument #> <chr> <chr> <chr> #> 1 John Lennon guitar #> 2 Paul McCartney bass #> 3 Ringo Starr drums #> 4 George Harrison guitar ``` --- ## Combine data frames by rows using `bind_rows()` Note: `bind_rows()` is from the **dplyr** library ```r members1 <- tibble( firstName = c("John", "Paul"), lastName = c("Lennon", "McCartney")) members2 <- tibble( firstName = c("Ringo", "George"), lastName = c("Starr", "Harrison")) ``` -- ```r bind_rows(members1, members2) ``` ``` #> # A tibble: 4 × 2 #> firstName lastName #> <chr> <chr> #> 1 John Lennon #> 2 Paul McCartney #> 3 Ringo Starr #> 4 George Harrison ``` --- ## Note: `bind_rows()` requires the **same** columns names: ```r *colnames(members2) <- c("firstName", "LastName") bind_rows(members1, members2) ``` ``` #> # A tibble: 4 × 3 #> firstName lastName LastName #> <chr> <chr> <chr> #> 1 John Lennon <NA> #> 2 Paul McCartney <NA> #> 3 Ringo <NA> Starr #> 4 George <NA> Harrison ``` Note how `<NA>`s were created --- class: inverse
−
+
06
:
00
## Your turn Answer these questions using the `animals_farm` and `animals_pet` data frames: 1. Write code to find how many _rows_ are in the `animals_farm` data frame? 2. Write code to find how many _columns_ are in the `animals_pet` data frame? 3. Create a new data frame, `animals`, by combining `animals_farm` and `animals_pet`. 4. Change the column names of `animals` to title case. 5. Add a new column to `animals` called `type` that tells if an animal is a `"farm"` or `"pet"` animal. --- class: inverse, middle # Week 10: .fancy[Data Frames] ### 1. Basics ### 2. .orange[Slicing] ### BREAK ### 3. External data --- ## Access data frame columns using the `$` symbol ```r beatles$firstName ``` ``` #> [1] "John" "Paul" "Ringo" "George" ``` -- ```r beatles$lastName ``` ``` #> [1] "Lennon" "McCartney" "Starr" "Harrison" ``` --- # Creating new variables with the `$` symbol -- Add the hometown of the bandmembers: ```r beatles$hometown <- 'Liverpool' beatles ``` ``` #> # A tibble: 4 × 6 #> firstName lastName instrument yearOfBirth deceased hometown #> <chr> <chr> <chr> <dbl> <lgl> <chr> #> 1 John Lennon guitar 1940 TRUE Liverpool #> 2 Paul McCartney bass 1942 FALSE Liverpool #> 3 Ringo Starr drums 1940 FALSE Liverpool #> 4 George Harrison guitar 1943 TRUE Liverpool ``` --- # Creating new variables with the `$` symbol Add a new `alive` variable: ```r beatles$alive <- c(FALSE, TRUE, TRUE, FALSE) beatles ``` ``` #> # A tibble: 4 × 7 #> firstName lastName instrument yearOfBirth deceased hometown alive #> <chr> <chr> <chr> <dbl> <lgl> <chr> <lgl> #> 1 John Lennon guitar 1940 TRUE Liverpool FALSE #> 2 Paul McCartney bass 1942 FALSE Liverpool TRUE #> 3 Ringo Starr drums 1940 FALSE Liverpool TRUE #> 4 George Harrison guitar 1943 TRUE Liverpool FALSE ``` --- ## You can compute new variables from current ones -- Compute and add the age of the bandmembers: ```r beatles$age <- 2023 - beatles$yearOfBirth beatles ``` ``` #> # A tibble: 4 × 8 #> firstName lastName instrument yearOfBirth deceased hometown alive age #> <chr> <chr> <chr> <dbl> <lgl> <chr> <lgl> <dbl> #> 1 John Lennon guitar 1940 TRUE Liverpool FALSE 83 #> 2 Paul McCartney bass 1942 FALSE Liverpool TRUE 81 #> 3 Ringo Starr drums 1940 FALSE Liverpool TRUE 83 #> 4 George Harrison guitar 1943 TRUE Liverpool FALSE 80 ``` --- ## Access elements by index: `DF[row, column]` General form for indexing elements: ```r DF[row, column] ``` -- .leftcol[ Select the element in row 1, column 2: ```r beatles[1, 2] ``` ``` #> # A tibble: 1 × 1 #> lastName #> <chr> #> 1 Lennon ``` ] -- .rightcol[ Select the elements in rows 1 & 2 and columns 2 & 3: ```r beatles[c(1, 2), c(2, 3)] ``` ``` #> # A tibble: 2 × 2 #> lastName instrument #> <chr> <chr> #> 1 Lennon guitar #> 2 McCartney bass ``` ] --- ## Leave row or column "blank" to select all -- ```r beatles[c(1, 2),] # Selects all COLUMNS for rows 1 & 2 ``` ``` #> # A tibble: 2 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE #> 2 Paul McCartney bass 1942 FALSE ``` -- ```r beatles[,c(1, 2)] # Selects all ROWS for columns 1 & 2 ``` ``` #> # A tibble: 4 × 2 #> firstName lastName #> <chr> <chr> #> 1 John Lennon #> 2 Paul McCartney #> 3 Ringo Starr #> 4 George Harrison ``` --- ## Negative indices exclude row / column -- ```r beatles[-1, ] # Select all ROWS except the first ``` ``` #> # A tibble: 3 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 Paul McCartney bass 1942 FALSE #> 2 Ringo Starr drums 1940 FALSE #> 3 George Harrison guitar 1943 TRUE ``` -- ```r beatles[,-1] # Select all COLUMNS except the first ``` ``` #> # A tibble: 4 × 4 #> lastName instrument yearOfBirth deceased #> <chr> <chr> <dbl> <lgl> #> 1 Lennon guitar 1940 TRUE #> 2 McCartney bass 1942 FALSE #> 3 Starr drums 1940 FALSE #> 4 Harrison guitar 1943 TRUE ``` --- # You can select columns by their names Note: you don't need the comma to select an entire column -- .leftcol[ One column ```r beatles['firstName'] ``` ``` #> # A tibble: 4 × 1 #> firstName #> <chr> #> 1 John #> 2 Paul #> 3 Ringo #> 4 George ``` ] -- <br>Multiple columns .rightcol[ ```r beatles[c('firstName', 'lastName')] ``` ``` #> # A tibble: 4 × 2 #> firstName lastName #> <chr> <chr> #> 1 John Lennon #> 2 Paul McCartney #> 3 Ringo Starr #> 4 George Harrison ``` ] --- ## Use logical indices to _filter_ rows -- **Which Beatles members are still alive?**<br>Create a logical vector using the `deceased` column: ```r beatles$deceased == FALSE ``` ``` #> [1] FALSE TRUE TRUE FALSE ``` -- Insert this logical vector in the ROW position of `beatles[,]`: ```r beatles[beatles$deceased == FALSE,] ``` ``` #> # A tibble: 2 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 Paul McCartney bass 1942 FALSE #> 2 Ringo Starr drums 1940 FALSE ``` --- class: inverse
−
+
10
:
00
## Your turn Answer these questions using the `beatles` data frame: 1. Create a new column, `playsGuitar`, which is `TRUE` if the band member plays the guitar and `FALSE` otherwise. 2. Filter the data frame to select only the rows for the band members who have four-letter first names. 3. Create a new column, `fullName`, which contains the band member's first and last name separated by a space (e.g. `"John Lennon"`) --- class: inverse, center # .fancy[Break]
−
+
05
:
00
--- class: inverse, middle # Week 10: .fancy[Data Frames] ### 1. Basics ### 2. Slicing ### BREAK ### 3. .orange[External data] --- # Getting data into R <br> ## Options: ## 1. Load external packages ## 2. Read in external files (usually a `.csv`* file) <br> *csv = "comma-separated values" --- ## Data from an R package ```r library(ggplot2) ``` -- See which data frames are available in a package: ```r data(package = "ggplot2") ``` --- # Find out about package data sets with `?` ```r ?msleep ``` ``` msleep {ggplot2} An updated and expanded version of the mammals sleep dataset Description This is an updated and expanded version of the mammals sleep dataset. Updated sleep times and weights were taken from V. M. Savage and G. B. West. A quantitative, theoretical framework for understanding mammalian sleep. Proceedings of the National Academy of Sciences, 104 (3):1051-1056, 2007. ``` --- # Previewing data frames: `msleep` -- Look at the data in a "spreadsheet"-like way: ```r View(msleep) ``` This is "read-only" so you can't corrupt the data 😄 --- # My favorite quick summary: `glimpse()` Preview each variable with `str()` or `glimpse()` ```r glimpse(msleep) ``` .code80[ ``` #> Rows: 83 #> Columns: 11 #> $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater short-tailed shrew", "Cow", "Three-toed sloth", "Northern fur seal", "Vesper mouse", "Dog", "Roe deer", "Goat", "Guinea pig", "Grivet", "Chinchilla", "Star-nosed mole", "Afri… #> $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bradypus", "Callorhinus", "Calomys", "Canis", "Capreolus", "Capri", "Cavis", "Cercopithecus", "Chinchilla", "Condylura", "Cricetomys", "Cryptotis", "Dasypus", "Dendrohyrax",… #> $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "carni", NA, "carni", "herbi", "herbi", "herbi", "omni", "herbi", "omni", "omni", "omni", "carni", "herbi", "omni", "herbi", "insecti", "herbi", "herbi", "omni", "omni", "herb… #> $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Artiodactyla", "Pilosa", "Carnivora", "Rodentia", "Carnivora", "Artiodactyla", "Artiodactyla", "Rodentia", "Primates", "Rodentia", "Soricomorpha", "Rodentia", "Soricomorpha"… #> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "domesticated", "lc", "lc", "domesticated", "lc", "domesticated", "lc", NA, "lc", "lc", "lc", "lc", "en", "lc", "domesticated", "domesticated", "lc", "lc", NA, "domesticated",… #> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5.3, 9.4, 10.0, 12.5, 10.3, 8.3, 9.1, 17.4, 5.3, 18.0, 3.9, 19.7, 2.9, 3.1, 10.1, 10.9, 14.9, 12.5, 9.8, 1.9, 2.7, 6.2, 6.3, 8.0, 9.5, 3.3, 19.4, 10.1, 14.2, 14.3, 12.8, 1… #> $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, 0.7, 1.5, 2.2, 2.0, 1.4, 3.1, 0.5, 4.9, NA, 3.9, 0.6, 0.4, 3.5, 1.1, NA, 3.2, 1.1, 0.4, 0.1, 1.5, 0.6, 1.9, 0.9, NA, 6.6, 1.2, 1.9, 3.1, NA, 1.4, 2.0, NA, NA, 0.9, NA, 0.… #> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, NA, 0.3333333, NA, NA, 0.2166667, NA, 0.1166667, NA, NA, 0.1500000, 0.3833333, NA, 0.3333333, NA, 0.1166667, 1.0000000, NA, 0.2833333, NA, NA, 0.4166667, 0.5500000, NA, NA… #> $ awake <dbl> 11.90, 7.00, 9.60, 9.10, 20.00, 9.60, 15.30, 17.00, 13.90, 21.00, 18.70, 14.60, 14.00, 11.50, 13.70, 15.70, 14.90, 6.60, 18.70, 6.00, 20.10, 4.30, 21.10, 20.90, 13.90, 13.10, 9.10, 11.50, 14.20, 22.10, 21.35, 17.80, 17.70, 16.0… #> $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0.09820, 0.11500, 0.00550, NA, 0.00640, 0.00100, 0.00660, 0.00014, 0.01080, 0.01230, 0.00630, 4.60300, 0.00030, 0.65500, 0.41900, 0.00350, 0.11500, NA, 0.02560, 0.00500, N… #> $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.045, 14.000, 14.800, 33.500, 0.728, 4.750, 0.420, 0.060, 1.000, 0.005, 3.500, 2.950, 1.700, 2547.000, 0.023, 521.000, 187.000, 0.770, 10.000, 0.071, 3.300, 0.200, 899.995, … ``` ] --- ## Also very useful for quick checks: `head()` and `tail()` .leftcol[ View the **first** 6 rows with `head()` ```r head(msleep) ``` .code80[ ``` #> # A tibble: 6 × 11 #> name genus vore order conservation sleep_total sleep_rem sleep_cycle awake brainwt bodywt #> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 Cheetah Acinonyx carni Carnivora lc 12.1 NA NA 11.9 NA 50 #> 2 Owl monkey Aotus omni Primates <NA> 17 1.8 NA 7 0.0155 0.48 #> 3 Mountain beaver Aplodontia herbi Rodentia nt 14.4 2.4 NA 9.6 NA 1.35 #> 4 Greater short-tailed shrew Blarina omni Soricomorpha lc 14.9 2.3 0.133 9.1 0.00029 0.019 #> 5 Cow Bos herbi Artiodactyla domesticated 4 0.7 0.667 20 0.423 600 #> 6 Three-toed sloth Bradypus herbi Pilosa <NA> 14.4 2.2 0.767 9.6 NA 3.85 ``` ]] .rightcol[ View the **last** 6 rows with `tail()` ```r tail(msleep) ``` .code80[ ``` #> # A tibble: 6 × 11 #> name genus vore order conservation sleep_total sleep_rem sleep_cycle awake brainwt bodywt #> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 Tenrec Tenrec omni Afrosoricida <NA> 15.6 2.3 NA 8.4 0.0026 0.9 #> 2 Tree shrew Tupaia omni Scandentia <NA> 8.9 2.6 0.233 15.1 0.0025 0.104 #> 3 Bottle-nosed dolphin Tursiops carni Cetacea <NA> 5.2 NA NA 18.8 NA 173. #> 4 Genet Genetta carni Carnivora <NA> 6.3 1.3 NA 17.7 0.0175 2 #> 5 Arctic fox Vulpes carni Carnivora <NA> 12.5 NA NA 11.5 0.0445 3.38 #> 6 Red fox Vulpes carni Carnivora <NA> 9.8 2.4 0.35 14.2 0.0504 4.23 ``` ]] --- # Importing an external data file <br> .leftcol60[ Note the `data.csv` file in your `data` folder. - **DO NOT** double-click it! - **DO NOT** open it in Excel! Excel can **corrupt** your data! (Don't believe me? read [this](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008984)) ] -- .rightcol40[ If you **must** open it in Excel: - Make a copy - Open the copy ] --- # Steps to importing external data files -- ## 1. Create a path to the data ```r library(here) *pathToData <- here('data', 'data.csv') pathToData ``` ``` #> [1] "/Users/jhelvy/gh/teaching/P4A/2023-Spring/class/8-data-frames/data/data.csv" ``` -- ## 2. Import the data ```r library(readr) *df <- read_csv(pathToData) ``` --- ## Using the **here** package to make file paths The `here()` function builds the path to your **root** to your _working directory_ <br>(this is where your `.Rproj` file lives!) ```r here() ``` ``` #> [1] "/Users/jhelvy/gh/teaching/P4A/2023-Spring/class/8-data-frames" ``` -- The `here()` function builds the path to files _inside_ your working directory ```r path_to_data <- here('data', 'data.csv') path_to_data ``` ``` #> [1] "/Users/jhelvy/gh/teaching/P4A/2023-Spring/class/8-data-frames/data/data.csv" ``` --- # Avoid hard-coding file paths! ### (they can break on different computers) ```r path_to_data <- 'data/data.csv' path_to_data ``` ``` #> [1] "data/data.csv" ``` # 💩💩💩 --- class: center .leftcol40[.left[ ## Use the **here** package to make file paths ]] .rightcol60[ <center><br> <img src="images/horst_monsters_here.png"> </center>Art by [Allison Horst](https://www.allisonhorst.com/) ] --- # Use `read_csv()`, not `read.csv()` .leftcol[ <center> <img src="images/read_csv.png" width=100%> </center> ] .rightcol[ ```r path_to_data <- here('data', 'data.csv') *data <- read_csv(path_to_data) ``` ] --- class: inverse
−
+
10
:
00
## Your turn .font90[ 1) Use the `here()` and `read_csv()` functions to load the `data.csv` file that is in the `data` folder. Name the data frame object `df`. 2) Use the `df` object to answer the following questions: - How many rows and columns are in the data frame? - Preview the different columns. What do you think this data is about? What might one row represent? What type of data is each column? (don't need to type this out...just inspect the data) - How many unique airports are in the data frame? - What is the earliest and latest observation in the data frame? - What is the lowest and highest cost of any one repair in the data frame? ] --- class: center ## Next week: better data wrangling with **dplyr** <center> <img src="images/horst_monsters_data_wrangling.png" width="600"> </center>Art by [Allison Horst](https://www.allisonhorst.com/) --- # Select rows with `filter()` Example: Filter rows to find which Beatles members are still alive? -- **Base R**: ```r beatles[beatles$deceased == FALSE,] ``` -- **dplyr**: ```r filter(beatles, deceased == FALSE) ``` --- # In 2 weeks: plotting with **ggplot2** .leftcol[ ## Translate _data_... ``` #> # A tibble: 11 × 2 #> brainwt bodywt #> <dbl> <dbl> #> 1 0.001 0.06 #> 2 0.0066 1 #> 3 0.00014 0.005 #> 4 0.0108 3.5 #> 5 0.0123 2.95 #> 6 0.0063 1.7 #> 7 4.60 2547 #> 8 0.0003 0.023 #> 9 0.655 521 #> 10 0.419 187 #> 11 0.0035 0.77 ``` ] .rightcol[ ## ...into _information_ <img src="figs/unnamed-chunk-76-1.png" width="468" /> ] --- # A note about HW 9 - You have what you need to start now. - It will be _much_ easier if you use the **dplyr** functions (i.e. read ahead). --- class: inverse
−
+
08
:
00
## Extra Practice! 1. Install the **dslabs** package. 2. Load the package, then use `data(package = "dslabs")` to see the different data sets in this package. 3. Pick one. 4. Answer these questions: - What is the dataset about? - How many observations are in the data frame? - What is the original source of the data? - What type of data is each variable? - Find one thing interesting about it to share.