class: middle, inverse .leftcol30[ <center> <img src="https://github.com/emse-p4a-gwu/emse-p4a-gwu.github.io/blob/master/images/logo.png?raw=true" width=250> </center> ] .rightcol70[ # Week 12: .fancy[Webscraping] ###
EMSE 4571 / 6571: Intro to Programming for Analytics ###
John Paul Helveston ###
April 18, 2024 ] --- class: inverse, middle # Week 12: .fancy[Webscraping] ### 1. Scraping static pages ### 2. Scraping multiple pages ### BREAK ### 3. Using APIs --- #### Some disclaimers ([here](https://r4ds.hadley.nz/webscraping.html#scraping-ethics-and-legalities) for more details) You're probably okay if the data is: - Public - Non-personal - Factual Otherwise, consult a lawyer and / or maybe don't scrape it. #### Terms of service Generally are not upheld, unless you **need an account to access the data**. #### Copyright Data is not copyright protected (in the US). But works are. Be careful. --- class: center, middle ## Another good resource:<br>https://www.zyte.com/learn/web-scraping-best-practices/ --- class: inverse, middle # Week 12: .fancy[Webscraping] ### 1. HTML basics ### 1. Scraping static pages ### 2. Scraping multiple pages ### BREAK ### 3. Using APIs --- ## **H**yper**T**ext **M**arkup **L**anguage ```html <html> <head> <title>Page title</title> </head> <body> <h1 id='first'>A heading</h1> <p>Some text & <b>some bold text.</b></p> <img src='myimg.png' width='100' height='100'> </body> ``` HTML has a hierarchical structure formed by: - Start and end **"tags"** (e.g. `<tag>` and `</tag>`) - Optional attributes (e.g. `id='first'`) - Contents (everything in between the start and end tag). --- .leftcol[ ## Common tags - `<h1>` = Header level 1 - `<a>` = [Url]() link - `<b>` = **Bold** text - `<i>` = _Italic_ text - `<p>` = Paragraph - `<li>` = List item ] .rightcol[ ## Attributes - `id`: Element identifier, e.g.<br>`<h1 id='first'>A heading</h1>` - `class`: Styling class, e.g.<br>`<h1 class='header'>A heading</h1>` ] --- class: middle .leftcol40[ # Quick example - Go [here](https://rvest.tidyverse.org/articles/starwars.html) - Right-click, select<br>"View Page Source" ] .rightcol60[ https://rvest.tidyverse.org/articles/starwars.html <center> <img src="images/view-source.png" width=100%> </center> ] --- ## **Strategy**: Use tags and classes to parse html .leftcol[ `source_code` ```html <html> <head> <title>Page title</title> </head> <body> * <h1 id='first'>A heading</h1> <p>Some text & <b>some bold text.</b></p> <img src='myimg.png' width='100' height='100'> </body> ``` ] -- .rightcol[ Use `{rvest}` package to parse html ```r library(rvest) html <- read_html(source_code) html %>% * html_elements("h1") ``` ``` #> {xml_nodeset (1)} #> [1] <h1 id="first">A heading</h1> ``` ] --- ## **Strategy**: Use tags and classes to parse html .leftcol[ `source_code` ```html <html> <head> <title>Page title</title> </head> <body> <h1 id='first'>A heading</h1> * <p>Some text & <b>some bold text.</b></p> <img src='myimg.png' width='100' height='100'> </body> ``` ] -- .rightcol[ Use `{rvest}` package to parse html ```r library(rvest) html <- read_html(source_code) html %>% * html_elements("p") ``` ``` #> {xml_nodeset (1)} #> [1] <p>Some text & <b>some bold text.</b></p> ``` ] --- ## Dealing with multiple nodes (bullet list example) .leftcol[ `source_code` ```html <ul> <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li> <li><b>R4-P17</b> is a <i>droid</i></li> <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li> <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li> </ul> ``` ] .rightcol[ Rendered source code (in a browser)
C-3PO
is a
droid
that weighs
167 kg
R4-P17
is a
droid
R2-D2
is a
droid
that weighs
96 kg
Yoda
weighs
66 kg
] --- ## Dealing with multiple nodes (bullet list example) .leftcol[ `source_code` ```html <ul> <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li> <li><b>R4-P17</b> is a <i>droid</i></li> <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li> <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li> </ul> ``` ] -- .rightcol[ Use `{rvest}` package to parse html ```r library(rvest) html <- read_html(source_code) html %>% * html_elements("li") ``` ``` #> {xml_nodeset (4)} #> [1] <li>\n<b>C-3PO</b> is a <i>droid</i> that weighs <span class="weight">167 kg</span>\n</li> #> [2] <li>\n<b>R4-P17</b> is a <i>droid</i>\n</li> #> [3] <li>\n<b>R2-D2</b> is a <i>droid</i> that weighs <span class="weight">96 kg</span>\n</li> #> [4] <li>\n<b>Yoda</b> weighs <span class="weight">66 kg</span>\n</li> ``` ] --- ## Extract the names with `"b"` .leftcol[ `source_code` ```html <ul> <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li> <li><b>R4-P17</b> is a <i>droid</i></li> <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li> <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li> </ul> ``` ] .rightcol[ Use `{rvest}` package to parse html ```r library(rvest) html <- read_html(source_code) html %>% html_elements("li") %>% * html_element("b") ``` ``` #> {xml_nodeset (4)} #> [1] <b>C-3PO</b> #> [2] <b>R4-P17</b> #> [3] <b>R2-D2</b> #> [4] <b>Yoda</b> ``` ] --- ## Extract the _text_ with `html_text2()` .leftcol[ `source_code` ```html <ul> <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li> <li><b>R4-P17</b> is a <i>droid</i></li> <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li> <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li> </ul> ``` ] .rightcol[ Use `{rvest}` package to parse html ```r library(rvest) html <- read_html(source_code) html %>% html_elements("li") %>% html_element("b") %>% * html_text2() ``` ``` #> [1] "C-3PO" "R4-P17" "R2-D2" "Yoda" ``` ] --- ## Extract the weights using `".weight"` class .leftcol[ `source_code` ```html <ul> <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li> <li><b>R4-P17</b> is a <i>droid</i></li> <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li> <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li> </ul> ``` ] .rightcol[ Use `{rvest}` package to parse html ```r library(rvest) html <- read_html(source_code) html %>% html_elements("li") %>% * html_element(".weight") %>% * html_text2() ``` ``` #> [1] "167 kg" NA "96 kg" "66 kg" ``` ] --- ## Putting it together in a data frame .leftcol45[ ```r library(rvest) items <- read_html(source_code) %>% html_elements("li") ``` ] .rightcol55[ ```r data <- tibble( name = items %>% html_element("b") %>% html_text2(), weight = items %>% html_element(".weight") %>% html_text2() %>% parse_number() ) data ``` ``` #> # A tibble: 4 × 2 #> name weight #> <chr> <dbl> #> 1 C-3PO 167 #> 2 R4-P17 NA #> 3 R2-D2 96 #> 4 Yoda 66 ``` ] --- ### `html_table()` is awesome (if the site uses an HTML table) .leftcol[ Some pages have HTML tables in the source code, e.g. https://www.ssa.gov/international/coc-docs/states.html <center> <img src="images/state-table.png" width=100%> </center> ] -- .rightcol[ ```r url <- "https://www.ssa.gov/international/coc-docs/states.html" df <- read_html(url) %>% * html_table() df ``` ``` #> [[1]] #> # A tibble: 56 × 2 #> X1 X2 #> <chr> <chr> #> 1 ALABAMA AL #> 2 ALASKA AK #> 3 AMERICAN SAMOA AS #> 4 ARIZONA AZ #> 5 ARKANSAS AR #> 6 CALIFORNIA CA #> 7 COLORADO CO #> 8 CONNECTICUT CT #> 9 DELAWARE DE #> 10 DISTRICT OF COLUMBIA DC #> # ℹ 46 more rows ``` ] --- ## Find elements with [SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=en) <center> <img src="images/selectorgadget.png" width=100%> </center> --- ## Find elements with "inspect" <center> <img src="images/p4a.png" width=1000> </center> --- class: inverse
−
+
15
:
00
## Your turn Scrape data on famous quotes from http://quotes.toscrape.com/ Your resulting data frame should have these fields: - `quote`: The quote - `author`: The author of the quote - `about_url`: The url to the "about" page ``` #> Rows: 10 #> Columns: 3 #> $ quote <chr> "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”", "“It is our choices, Harry, that show what we truly are, far more than our abilities.”", "“There are only two w… #> $ author <chr> "Albert Einstein", "J.K. Rowling", "Albert Einstein", "Jane Austen", "Marilyn Monroe", "Albert Einstein", "André Gide", "Thomas A. Edison", "Eleanor Roosevelt", "Steve Martin" #> $ about_url <chr> "http://quotes.toscrape.com/author/Albert-Einstein", "http://quotes.toscrape.com/author/J-K-Rowling", "http://quotes.toscrape.com/author/Albert-Einstein", "http://quotes.toscrape.com/author/Jane-Austen", "http://quotes.toscrape.co… ``` --- class: inverse, middle # Week 12: .fancy[Webscraping] ### 1. HTML basics ### 1. Scraping static pages ### 2. Scraping multiple pages ### BREAK ### 3. Using APIs --- class: center, middle, inverse # What if there is more than one page to scrape? -- <br> # .orange[Use a loop!] --- # Iterative scraping! <br> ## 1. Find the url pattern ## 2. Scrape one page ## 3. Iteratively scrape each page with `map_df()` --- ## 1. Find the url pattern Example: http://quotes.toscrape.com/ url to page 2: http://quotes.toscrape.com/page/2 Pattern: `http://quotes.toscrape.com/page/` + `#` -- <br> I can _build_ the url to any page with `paste()`: ```r root <- "http://quotes.toscrape.com/page/" page <- 3 url <- paste(root, page, sep = "") url ``` ``` #> [1] "http://quotes.toscrape.com/page/3" ``` --- ## 2. Scrape one page .leftcol[ Build the url to a single page: ```r root <- "http://quotes.toscrape.com/page/" page <- 3 url <- paste(root, page, sep = "") *url ``` ``` #> [1] "http://quotes.toscrape.com/page/3" ``` ] .rightcol[ Scrape the data on that page: ```r *quote_nodes <- read_html(url) %>% html_elements(".quote") df <- tibble( quote = quote_nodes %>% html_element(".text") %>% html_text(), author = quote_nodes %>% html_element(".author") %>% html_text(), about_url = quote_nodes %>% html_element("a") %>% html_attr("href") ) %>% mutate(about_url = paste0(url, about_url)) ``` ] --- ## 3. Iteratively scrape each page with `map_df()` .leftcol55[ Make a function to get data from a page: .code70[ ```r get_page_data <- function(page) { root <- "http://quotes.toscrape.com/page/" url <- paste(root, page, sep = "") quote_nodes <- read_html(url) %>% html_elements(".quote") df <- tibble( quote = quote_nodes %>% html_element(".text") %>% html_text(), author = quote_nodes %>% html_element(".author") %>% html_text(), about_url = quote_nodes %>% html_element("a") %>% html_attr("href") ) %>% mutate(about_url = paste0(url, about_url)) return(df) } ``` ]] -- .rightcol45[ Iterate with `map_df()`: .code70[ ```r pages <- 1:10 df <- map_df(pages, \(x) get_page_data(x)) ``` ]] --- class: inverse
−
+
15
:
00
## Your turn Template code is provided to scrape data on F1 drivers for the 2022 season from https://www.formula1.com/en/results.html/2022/drivers.html Your job is to extend it to scrape the data from seasons 2010 to 2024. Your final dataset should look like this: ``` #> # A tibble: 6 × 8 #> year position first last abb nationality team points #> <dbl> <int> <chr> <chr> <chr> <chr> <chr> <int> #> 1 2022 1 Max Verstappen VER NED Red Bull Racing RBPT 454 #> 2 2022 2 Charles Leclerc LEC MON Ferrari 308 #> 3 2022 3 Sergio Perez PER MEX Red Bull Racing RBPT 305 #> 4 2022 4 George Russell RUS GBR Mercedes 275 #> 5 2022 5 Carlos Sainz SAI ESP Ferrari 246 #> 6 2022 6 Lewis Hamilton HAM GBR Mercedes 240 ``` --- class: inverse, center # .fancy[Intermission]
−
+
05
:
00
--- class: inverse, middle # Week 12: .fancy[Webscraping] ### 1. HTML basics ### 1. Scraping static pages ### 2. Scraping multiple pages ### BREAK ### 3. Using APIs --- class: center, middle, inverse # Hopefully you won't need to scrape --- # Before you start scraping, ask... <br> ## 1. Is there a formatted dataset I can download?<br>(e.g. see [this page](https://eda.seas.gwu.edu/2023-Fall/finding-data.html)) -- ## 2. Is there an API I can use? --- class: middle # .center[Application Programming Interface (API)] <br> > A set of defined rules that enable different applications to communicate (and pass data) with each other -- <br> #### .center[Basically, APIs make it easier to get data from the web] --- # APIs use the `url` to "ask" a website for data ### **Example**: Stock market prices from https://www.alphavantage.co/ -- API Request:<br>https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol={symbol}&apikey={api_key}&datatype=csv - `function`: The time series of your choice - `symbol`: Stock price symbol (e.g. `NFLX` = Netflix) - `apikey`: Your API key (have to register to get one) - `datatype`: `csv` or `json` --- ## Setting up your API key ### 1. Register for a key here: https://www.alphavantage.co/support/#api-key ### 2. Store your key in your `.Renviron`: ```r usethis::edit_r_environ() ``` ### 3. Store your key: ```r ALPHAVANTAGE_API_KEY=ka8hi45qi2j5ja657fj963 ``` ### 4. Retrieve your key: ```r api_key <- Sys.getenv("ALPHAVANTAGE_API_KEY") ``` --- ## Using your key to get data .leftcol55[ ```r api_key <- Sys.getenv("ALPHAVANTAGE_API_KEY") symbol <- "NFLX" # Netflix # Build the url data request url <- paste0( "https://www.alphavantage.co/query", "?function=TIME_SERIES_DAILY", "&symbol=", symbol, "&apikey=", api_key, "&datatype=csv" ) # Read in the data df <- readr::read_csv(url) ``` ] -- .rightcol45[ ```r glimpse(df) ``` ``` #> Rows: 100 #> Columns: 6 #> $ timestamp <date> 2024-04-18, 2024-04-17, 2024-04-16, 2024-04-15, 2024-04-12, 2024-04-11, 2024-04-10, 2024-04-09, 2024-04-08, 2024-04-05, 2024-04-04, 2024-04-03, 2024-04-02, 2024-04-01, 2024-03-28, 2024-03-27, 2024-03-26, 2024-03-25, 2024-03-22, 2… #> $ open <dbl> 612.100, 620.970, 607.500, 630.170, 628.230, 624.420, 610.970, 631.990, 636.390, 624.920, 633.210, 612.745, 611.000, 608.000, 614.990, 629.010, 625.200, 627.900, 624.160, 630.650, 619.950, 615.620, 613.560, 622.920, 615.000, 613.3… #> $ high <dbl> 621.3300, 620.9700, 622.4500, 630.1700, 633.1199, 631.6600, 620.1400, 631.9900, 639.0000, 637.9100, 638.0000, 630.4100, 615.0300, 615.1100, 615.0000, 631.3500, 634.3899, 630.4600, 629.0500, 634.3617, 629.5050, 621.2800, 627.4100, … #> $ low <dbl> 605.4350, 607.7100, 607.5000, 603.8710, 618.9150, 617.2400, 609.3400, 615.6347, 628.1100, 622.7100, 616.5800, 611.5000, 605.5101, 605.5710, 601.5900, 610.7300, 619.1836, 623.1600, 621.0000, 622.3300, 618.3400, 608.0000, 610.4481, … #> $ close <dbl> 610.56, 613.69, 617.52, 607.15, 622.83, 628.78, 618.58, 618.20, 628.41, 636.18, 617.14, 630.08, 614.21, 614.31, 607.33, 613.53, 629.24, 627.46, 628.01, 622.71, 627.69, 620.74, 618.39, 605.88, 613.01, 609.45, 611.08, 600.93, 604.82… #> $ volume <dbl> 8468407, 3312222, 3519122, 3085394, 2959269, 2662662, 2806248, 2135639, 2129483, 3327195, 3008557, 2913989, 2018210, 2063845, 3708803, 2628267, 2804453, 1803264, 2135688, 2507671, 2639509, 2142613, 3344244, 6671629, 3110468, 21782… ``` ] --- ## Using your key to get data .leftcol55[ ```r df %>% ggplot() + geom_line( aes( x = timestamp, y = close ) ) + theme_bw() + labs( x = "Date", y = "Closing Price ($USD)", title = paste0("Stock Prices: ", symbol) ) ``` ] .rightcol45[ <img src="figs/unnamed-chunk-48-1.png" width="504" /> ] --- class: center, middle # Want something else? # Read the docs! ## https://www.alphavantage.co/documentation/ --- class: inverse
−
+
20
:
00
### Your turn: COVID case data from https://covidactnow.org/ 1. Register for a key here: https://apidocs.covidactnow.org/ 2. Edit your .Renviron:<br>`usethis::edit_r_environ()` 3. Store your key as `COVID_ACT_NOW_KEY` 4. Load your API key:<br>`api_key <- Sys.getenv("COVID_ACT_NOW_KEY")` 5. Build the url to request historical state-level data 6. Read in the data, then make this figure of daily COVID19 cases in DC <img src="figs/covid_dc-1.png" width="864" /> --- # [HW12](https://p4a.seas.gwu.edu/2024-Spring/hw/12-webscraping.html)