Webscraping

]

# Week 12: .fancy[Webscraping]

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M243.4 2.6l-224 96c-14 6-21.8 21-18.7 35.8S16.8 160 32 160v8c0 13.3 10.7 24 24 24H456c13.3 0 24-10.7 24-24v-8c15.2 0 28.3-10.7 31.3-25.6s-4.8-29.9-18.7-35.8l-224-96c-8-3.4-17.2-3.4-25.2 0zM128 224H64V420.3c-.6 .3-1.2 .7-1.8 1.1l-48 32c-11.7 7.8-17 22.4-12.9 35.9S17.9 512 32 512H480c14.1 0 26.5-9.2 30.6-22.7s-1.1-28.1-12.9-35.9l-48-32c-.6-.4-1.2-.7-1.8-1.1V224H384V416H344V224H280V416H232V224H168V416H128V224zM256 64a32 32 0 1 1 0 64 32 32 0 1 1 0-64z"/></svg> EMSE 4571 / 6571: Intro to Programming for Analytics
### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M304 128a80 80 0 1 0 -160 0 80 80 0 1 0 160 0zM96 128a128 128 0 1 1 256 0A128 128 0 1 1 96 128zM49.3 464H398.7c-8.9-63.3-63.3-112-129-112H178.3c-65.7 0-120.1 48.7-129 112zM0 482.3C0 383.8 79.8 304 178.3 304h91.4C368.2 304 448 383.8 448 482.3c0 16.4-13.3 29.7-29.7 29.7H29.7C13.3 512 0 498.7 0 482.3z"/></svg> John Paul Helveston
### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M152 24c0-13.3-10.7-24-24-24s-24 10.7-24 24V64H64C28.7 64 0 92.7 0 128v16 48V448c0 35.3 28.7 64 64 64H384c35.3 0 64-28.7 64-64V192 144 128c0-35.3-28.7-64-64-64H344V24c0-13.3-10.7-24-24-24s-24 10.7-24 24V64H152V24zM48 192H400V448c0 8.8-7.2 16-16 16H64c-8.8 0-16-7.2-16-16V192z"/></svg> April 18, 2024

]

---

# Week 12: .fancy[Webscraping]

### 1. Scraping static pages
### 2. Scraping multiple pages

### BREAK

### 3. Using APIs

---

#### Some disclaimers ([here](https://r4ds.hadley.nz/webscraping.html#scraping-ethics-and-legalities) for more details)

You're probably okay if the data is:

- Public
- Non-personal
- Factual

Otherwise, consult a lawyer and / or maybe don't scrape it.

#### Terms of service

Generally are not upheld, unless you **need an account to access the data**.

#### Copyright

Data is not copyright protected (in the US). But works are. Be careful.

---

## Another good resource:<br>https://www.zyte.com/learn/web-scraping-best-practices/

---

# Week 12: .fancy[Webscraping]

### 1. HTML basics
### 1. Scraping static pages
### 2. Scraping multiple pages

### BREAK

### 3. Using APIs

---

## **H**yper**T**ext **M**arkup **L**anguage

```html
<html>
<head>
  <title>Page title</title>
</head>
<body>
  <h1 id='first'>A heading</h1>
  <p>Some text &amp; <b>some bold text.</b></p>
  <img src='myimg.png' width='100' height='100'>
</body>
```

HTML has a hierarchical structure formed by:

- Start and end **"tags"** (e.g. `<tag>` and `</tag>`)
- Optional attributes (e.g. `id='first'`)
- Contents (everything in between the start and end tag).

---

## Common tags

- `<h1>` = Header level 1
- `<a>` = [Url]() link
- `<b>` = **Bold** text 
- `<i>` = _Italic_ text
- `<p>` = Paragraph
- `<li>` = List item

]

## Attributes

- `id`: Element identifier, e.g.<br>`<h1 id='first'>A heading</h1>`
- `class`: Styling class, e.g.<br>`<h1 class='header'>A heading</h1>`

]

---

# Quick example

- Go [here](https://rvest.tidyverse.org/articles/starwars.html)
- Right-click, select<br>"View Page Source"

]

https://rvest.tidyverse.org/articles/starwars.html
<center>
<img src="images/view-source.png" width=100%>
</center>

]

---

## **Strategy**: Use tags and classes to parse html

`source_code`

```html
<html>
<head>
  <title>Page title</title>
</head>
<body>
* <h1 id='first'>A heading</h1>
  <p>Some text &amp; <b>some bold text.</b></p>
  <img src='myimg.png' width='100' height='100'>
</body>
```

]

Use `{rvest}` package to parse html

```r
library(rvest)

html <- read_html(source_code)

html %>% 
* html_elements("h1")
```

```
#> {xml_nodeset (1)}
#> [1] <h1 id="first">A heading</h1>
```

]

---

## **Strategy**: Use tags and classes to parse html

`source_code`

```html
<html>
<head>
  <title>Page title</title>
</head>
<body>
  <h1 id='first'>A heading</h1>
* <p>Some text &amp; <b>some bold text.</b></p>
  <img src='myimg.png' width='100' height='100'>
</body>
```

]

Use `{rvest}` package to parse html

```r
library(rvest)

html <- read_html(source_code)

html %>% 
* html_elements("p")
```

```
#> {xml_nodeset (1)}
#> [1] <p>Some text &amp; <b>some bold text.</b></p>
```

]

---

## Dealing with multiple nodes (bullet list example)

`source_code`

```html
<ul>
  <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li>
  <li><b>R4-P17</b> is a <i>droid</i></li>
  <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li>
  <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li>
</ul>
```

]

Rendered source code (in a browser)

<ul>
  <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li>
  <li><b>R4-P17</b> is a <i>droid</i></li>
  <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li>
  <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li>
</ul>

]

---

## Dealing with multiple nodes (bullet list example)

`source_code`

]

Use `{rvest}` package to parse html

```r
library(rvest)

html <- read_html(source_code)

html %>% 
* html_elements("li")
```

```
#> {xml_nodeset (4)}
#> [1] <li>\n<b>C-3PO</b> is a <i>droid</i> that weighs <span class="weight">167 kg</span>\n</li>
#> [2] <li>\n<b>R4-P17</b> is a <i>droid</i>\n</li>
#> [3] <li>\n<b>R2-D2</b> is a <i>droid</i> that weighs <span class="weight">96 kg</span>\n</li>
#> [4] <li>\n<b>Yoda</b> weighs <span class="weight">66 kg</span>\n</li>
```

]

---

## Extract the names with `"b"`

`source_code`

]

Use `{rvest}` package to parse html

```r
library(rvest)

html <- read_html(source_code)

html %>% 
  html_elements("li") %>% 
* html_element("b")
```

```
#> {xml_nodeset (4)}
#> [1] <b>C-3PO</b>
#> [2] <b>R4-P17</b>
#> [3] <b>R2-D2</b>
#> [4] <b>Yoda</b>
```

]

---

## Extract the _text_ with `html_text2()`

`source_code`

]

Use `{rvest}` package to parse html

```r
library(rvest)

html <- read_html(source_code)

html %>% 
  html_elements("li") %>% 
  html_element("b") %>% 
* html_text2()
```

```
#> [1] "C-3PO"  "R4-P17" "R2-D2"  "Yoda"
```

]

---

## Extract the weights using `".weight"` class

`source_code`

]

Use `{rvest}` package to parse html

```r
library(rvest)

html <- read_html(source_code)

html %>% 
  html_elements("li") %>% 
* html_element(".weight") %>%
* html_text2()
```

```
#> [1] "167 kg" NA       "96 kg"  "66 kg"
```

]

---

## Putting it together in a data frame

```r
library(rvest)

items <- read_html(source_code) %>% 
  html_elements("li")
```

]

```r
data <- tibble(
  name = items %>% 
    html_element("b") %>% 
    html_text2(), 
  weight = items %>% 
    html_element(".weight") %>% 
    html_text2() %>% 
    parse_number()
)

data
```

```
#> # A tibble: 4 × 2
#>   name   weight
#>   <chr>   <dbl>
#> 1 C-3PO     167
#> 2 R4-P17     NA
#> 3 R2-D2      96
#> 4 Yoda       66
```

]

---

### `html_table()` is awesome (if the site uses an HTML table)

Some pages have HTML tables in the source code, e.g.

https://www.ssa.gov/international/coc-docs/states.html

]

```r
url <- "https://www.ssa.gov/international/coc-docs/states.html"
df <- read_html(url) %>% 
* html_table()

df
```

```
#> [[1]]
#> # A tibble: 56 × 2
#>    X1                   X2   
#>    <chr>                <chr>
#>  1 ALABAMA              AL   
#>  2 ALASKA               AK   
#>  3 AMERICAN SAMOA       AS   
#>  4 ARIZONA              AZ   
#>  5 ARKANSAS             AR   
#>  6 CALIFORNIA           CA   
#>  7 COLORADO             CO   
#>  8 CONNECTICUT          CT   
#>  9 DELAWARE             DE   
#> 10 DISTRICT OF COLUMBIA DC   
#> # ℹ 46 more rows
```

]

---

## Find elements with [SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=en)

---

## Find elements with "inspect"

---

## Your turn

Scrape data on famous quotes from
http://quotes.toscrape.com/

Your resulting data frame should have these fields:

- `quote`: The quote
- `author`: The author of the quote
- `about_url`: The url to the "about" page

```
#> Rows: 10
#> Columns: 3
#> $ quote     <chr> "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”", "“It is our choices, Harry, that show what we truly are, far more than our abilities.”", "“There are only two w…
#> $ author    <chr> "Albert Einstein", "J.K. Rowling", "Albert Einstein", "Jane Austen", "Marilyn Monroe", "Albert Einstein", "André Gide", "Thomas A. Edison", "Eleanor Roosevelt", "Steve Martin"
#> $ about_url <chr> "http://quotes.toscrape.com/author/Albert-Einstein", "http://quotes.toscrape.com/author/J-K-Rowling", "http://quotes.toscrape.com/author/Albert-Einstein", "http://quotes.toscrape.com/author/Jane-Austen", "http://quotes.toscrape.co…
```

---

# Week 12: .fancy[Webscraping]

### 1. HTML basics
### 1. Scraping static pages
### 2. Scraping multiple pages

### BREAK

### 3. Using APIs

---

# What if there is more than one page to scrape?

<br>

# .orange[Use a loop!]

---

# Iterative scraping!

<br>

## 1. Find the url pattern
## 2. Scrape one page
## 3. Iteratively scrape each page with `map_df()`

---

## 1. Find the url pattern

Example: http://quotes.toscrape.com/

url to page 2: http://quotes.toscrape.com/page/2

Pattern: `http://quotes.toscrape.com/page/` + `#`

<br>

I can _build_ the url to any page with `paste()`:

```r
root <- "http://quotes.toscrape.com/page/"
page <- 3
url <- paste(root, page, sep = "")
url
```

```
#> [1] "http://quotes.toscrape.com/page/3"
```

---

## 2. Scrape one page

Build the url to a single page:

```r
root <- "http://quotes.toscrape.com/page/"
page <- 3
url <- paste(root, page, sep = "")
*url
```

```
#> [1] "http://quotes.toscrape.com/page/3"
```

]

Scrape the data on that page:

```r
*quote_nodes <- read_html(url) %>%
    html_elements(".quote")
df <- tibble(
    quote = quote_nodes %>%
        html_element(".text") %>%
        html_text(),
    author = quote_nodes %>%
        html_element(".author") %>%
        html_text(), 
    about_url = quote_nodes %>%
        html_element("a") %>% 
        html_attr("href")
) %>% 
    mutate(about_url = paste0(url, about_url))
```

]

---

## 3. Iteratively scrape each page with `map_df()`

Make a function to get data from a page:

```r
get_page_data <- function(page) {
    root <- "http://quotes.toscrape.com/page/"
    url <- paste(root, page, sep = "")
    quote_nodes <- read_html(url) %>% 
        html_elements(".quote")
    df <- tibble(
        quote = quote_nodes %>%
            html_element(".text") %>%
            html_text(),
        author = quote_nodes %>%
            html_element(".author") %>%
            html_text(), 
        about_url = quote_nodes %>%
            html_element("a") %>% 
            html_attr("href")
    ) %>% 
        mutate(about_url = paste0(url, about_url))
    return(df)
}
```

]]

Iterate with `map_df()`:

```r
pages <- 1:10

df <- map_df(pages, \(x) get_page_data(x))
```

]]

---

## Your turn

Template code is provided to scrape data on F1 drivers for the 2022 season from
https://www.formula1.com/en/results.html/2022/drivers.html

Your job is to extend it to scrape the data from seasons 2010 to 2024.

Your final dataset should look like this:

```
#> # A tibble: 6 × 8
#>    year position first   last       abb   nationality team                 points
#>   <dbl>    <int> <chr>   <chr>      <chr> <chr>       <chr>                 <int>
#> 1  2022        1 Max     Verstappen VER   NED         Red Bull Racing RBPT    454
#> 2  2022        2 Charles Leclerc    LEC   MON         Ferrari                 308
#> 3  2022        3 Sergio  Perez      PER   MEX         Red Bull Racing RBPT    305
#> 4  2022        4 George  Russell    RUS   GBR         Mercedes                275
#> 5  2022        5 Carlos  Sainz      SAI   ESP         Ferrari                 246
#> 6  2022        6 Lewis   Hamilton   HAM   GBR         Mercedes                240
```

---

# .fancy[Intermission]

---

# Week 12: .fancy[Webscraping]

### 1. HTML basics
### 1. Scraping static pages
### 2. Scraping multiple pages

### BREAK

### 3. Using APIs

---

# Hopefully you won't need to scrape

---

# Before you start scraping, ask...

<br>

## 1. Is there a formatted dataset I can download?<br>(e.g. see [this page](https://eda.seas.gwu.edu/2023-Fall/finding-data.html))

## 2. Is there an API I can use?

---

# .center[Application Programming Interface (API)]

<br>

> A set of defined rules that enable different applications to communicate (and pass data) with each other

<br>

#### .center[Basically, APIs make it easier to get data from the web]

---

# APIs use the `url` to "ask" a website for data

### **Example**: Stock market prices from https://www.alphavantage.co/

API Request:<br>https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol={symbol}&apikey={api_key}&datatype=csv

- `function`: The time series of your choice
- `symbol`: Stock price symbol (e.g. `NFLX` = Netflix)
- `apikey`: Your API key (have to register to get one)
- `datatype`: `csv` or `json`

---

## Setting up your API key

### 1. Register for a key here: https://www.alphavantage.co/support/#api-key
### 2. Store your key in your `.Renviron`:

```r
usethis::edit_r_environ()
```

### 3. Store your key:

```r
ALPHAVANTAGE_API_KEY=ka8hi45qi2j5ja657fj963
```

### 4. Retrieve your key:

```r
api_key <- Sys.getenv("ALPHAVANTAGE_API_KEY")
```

---

## Using your key to get data

```r
api_key <- Sys.getenv("ALPHAVANTAGE_API_KEY")
symbol <- "NFLX" # Netflix

# Build the url data request

url <- paste0(
  "https://www.alphavantage.co/query", 
  "?function=TIME_SERIES_DAILY",
  "&symbol=", symbol, 
  "&apikey=", api_key, 
  "&datatype=csv"
)

# Read in the data

df <- readr::read_csv(url)
```

]

```r
glimpse(df)
```

```
#> Rows: 100
#> Columns: 6
#> $ timestamp <date> 2024-04-18, 2024-04-17, 2024-04-16, 2024-04-15, 2024-04-12, 2024-04-11, 2024-04-10, 2024-04-09, 2024-04-08, 2024-04-05, 2024-04-04, 2024-04-03, 2024-04-02, 2024-04-01, 2024-03-28, 2024-03-27, 2024-03-26, 2024-03-25, 2024-03-22, 2…
#> $ open      <dbl> 612.100, 620.970, 607.500, 630.170, 628.230, 624.420, 610.970, 631.990, 636.390, 624.920, 633.210, 612.745, 611.000, 608.000, 614.990, 629.010, 625.200, 627.900, 624.160, 630.650, 619.950, 615.620, 613.560, 622.920, 615.000, 613.3…
#> $ high      <dbl> 621.3300, 620.9700, 622.4500, 630.1700, 633.1199, 631.6600, 620.1400, 631.9900, 639.0000, 637.9100, 638.0000, 630.4100, 615.0300, 615.1100, 615.0000, 631.3500, 634.3899, 630.4600, 629.0500, 634.3617, 629.5050, 621.2800, 627.4100, …
#> $ low       <dbl> 605.4350, 607.7100, 607.5000, 603.8710, 618.9150, 617.2400, 609.3400, 615.6347, 628.1100, 622.7100, 616.5800, 611.5000, 605.5101, 605.5710, 601.5900, 610.7300, 619.1836, 623.1600, 621.0000, 622.3300, 618.3400, 608.0000, 610.4481, …
#> $ close     <dbl> 610.56, 613.69, 617.52, 607.15, 622.83, 628.78, 618.58, 618.20, 628.41, 636.18, 617.14, 630.08, 614.21, 614.31, 607.33, 613.53, 629.24, 627.46, 628.01, 622.71, 627.69, 620.74, 618.39, 605.88, 613.01, 609.45, 611.08, 600.93, 604.82…
#> $ volume    <dbl> 8468407, 3312222, 3519122, 3085394, 2959269, 2662662, 2806248, 2135639, 2129483, 3327195, 3008557, 2913989, 2018210, 2063845, 3708803, 2628267, 2804453, 1803264, 2135688, 2507671, 2639509, 2142613, 3344244, 6671629, 3110468, 21782…
```

]

---

## Using your key to get data

```r
df %>% 
    ggplot() + 
    geom_line(
      aes(
        x = timestamp, 
        y = close
      )
    ) + 
    theme_bw() +
    labs(
        x = "Date",
        y = "Closing Price ($USD)", 
        title = paste0("Stock Prices: ", symbol)
    )
```

]

]

---

# Want something else?

# Read the docs!

## https://www.alphavantage.co/documentation/

---

### Your turn: COVID case data from https://covidactnow.org/

1. Register for a key here: https://apidocs.covidactnow.org/
2. Edit your .Renviron:<br>`usethis::edit_r_environ()`
3. Store your key as `COVID_ACT_NOW_KEY`
4. Load your API key:<br>`api_key <- Sys.getenv("COVID_ACT_NOW_KEY")`
5. Build the url to request historical state-level data
6. Read in the data, then make this figure of daily COVID19 cases in DC

---

# [HW12](https://p4a.seas.gwu.edu/2024-Spring/hw/12-webscraping.html)