18 Collecting Web Data

In this chapter, we introduce two common ways to collect data from the web in R. One approach uses R packages that connect to online data services and return results directly in R. The other approach uses web scraping tools to extract information from web pages when no suitable service interface is available.

In practice, these two approaches solve different problems. If a package already provides access to an online source, it is usually the easiest and most reliable option. When no such interface exists, scraping may still allow us to collect information from a page, provided that we respect the site’s terms of use, technical limits, and applicable law.

18.1 CRAN Task View: Web Technologies and Services

A useful starting point is the CRAN Task View on Web Technologies and Services. It summarizes packages and strategies for efficiently interacting with resources over the internet with R.

Several categories from that Task View are especially relevant for data collection. First, tools for HTTP requests are helpful when we need to send requests to a site or service.

Second, parsers for HTML, XML, and JSON are essential after data have been downloaded, because web data are often stored in one of these formats.

Besides, packages that connect to online services can also be valuable. Depending on the source, such packages may provide access to financial data, publication data, social media data, or analytics data. The main advantage is that they hide much of the low-level request and response handling from the user.

18.2 R API wrappers

A common way to access data on the web is using the APIs provided by the web services. API stands for Application Programming Interface (API). It is a set of rules between two pieces of software in order for them to interact with each other.

When we work with an API to retrieve data from an online service, three ideas are especially important: access, request, and response.

Access concerns who is allowed to use the service. Some services are open, while others require authentication, registration, or paid access.
Request refers to the information we send to the service, such as a ticker symbol, date range, or keyword.
Response is the data returned by the service, often in a structured format such as JSON, XML, or CSV.

Many services also impose usage limits. For example, they may restrict how many requests can be made in a given period. These limits are put in place to ensure that the service provider’s servers are not overloaded and can operate under predictable loads. When a limit is reached, additional requests may be delayed, blocked, or charged at a higher rate.

An API wrapper is a package that makes an API easier to use from a specific programming language. In R, an API wrapper typically converts several technical steps into a small number of user-friendly functions. This lets us focus on the data we want instead of the details of constructing requests and parsing responses.

Not every web service offers a well-documented public API. In those cases, we may need to collect information from web pages manually. Before doing so, we should always check the site’s policies and proceed responsibly.

18.3 Yahoo Finance data

Several R packages can retrieve Yahoo Finance data, including quantmod, tidyquant, and yfR.

Two widely used options are quantmod and tidyquant.

`quantmod`

With quantmod, the main function is getSymbols(). The first argument specifies the symbols. getSymbols() loads one object per symbol into an environment.

library(quantmod)

tickers <- c("AFL","AAPL", "MMM")
stock_env <- new.env()
getSymbols(tickers, from = "2021-01-01", to = "2021-01-31", env = stock_env)

In this example, each downloaded series is stored in stock_env as an xts object. We choose to create a new environment to collect data. Creating a separate environment is often convenient because it keeps the global workspace clean. Otherwise, all the objects would go to our current workspace.

If we want to work with the downloaded objects as a collection, we can convert the environment to a named list. Here we use eapply() to apply the function cbind() on those objects and bind them by columns into a wide format.

stock_list <- eapply(stock_env, cbind)

`tidyquant`

The package tidyquant provides a more tidy-data-oriented interface. Its function tq_get() retrieves financial data; the argument get = "stock.prices" tells tq_get() which type of data to retrieve.

library(tidyquant)

tickers <- c("AFL","AAPL", "MMM", "META", "AMZN")
tq <- tq_get(tickers, get = "stock.prices", from = "2021-01-01", to = "2021-01-31")

tq_get() returns the result as a tibble, typically in long format. This format is often easier to use with packages from the tidyverse because each row corresponds to one observation for one symbol on one date.

In short, quantmod is convenient when we want time series objects and direct access to financial modeling tools, while tidyquant is especially convenient when we want a tidy tibble that fits naturally into a data wrangling workflow.

18.4 Web scraping

When a site does not provide a convenient data interface, we may need to scrape information from its HTML pages. In this context, scraping means programmatically collecting information intended for human readers and converting it into a form suitable for analysis.

A common example is the table of S&P 500 constituents on Wikipedia. To collect S&P 500 company data, we use the package rvest, which is also part of tidyverse. rvest allows us to more easily scrape data from the web pages.

library(rvest)

url <- "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
tickers <- url %>%
  read_html() %>%
  html_element("#constituents") %>% 
  html_table()

The function read_html() reads the page and creates an R representation of the HTML document. The function html_element() then selects the table node we want, and html_table() parses that table into a tibble.

When scraping a page, the main task is to identify the correct HTML element. Browser developer tools can help us inspect the page structure and locate ids, classes, or other attributes that uniquely identify the content we want. Once the relevant node is selected, rvest can extract tables, links, text, and other page elements.

Web scraping is powerful, but it is also more fragile than using a package interface. If the layout of a web page changes, our code may stop working even though the page is still available. For that reason, scraping code should be kept simple, checked regularly, and used with care.