10 Inputs and Outputs

Data comes from various sources and in various formats. They can be structured and unstructured. They can be numbers, texts, networks, audio, visual and more.

In this chapter, we discuss how to read and write files in R. We’ll primarily be working with structured data. We’ll discuss unstructured textual data in the case study in the final module.

In R, we will be primarily working with data frames. A data frame is the recommended R data structure for storing and representing tabular data. If the data is not already a data frame, we will turn it into a format that R is able to process.

In the sections below, we will focus on three topics:

Internal datasets and standard flat files (e.g., CSV/TSV)
Specialized formats such as JSON, and proprietary software, including Excel and stats software (e.g., SAS/SPSS)
Workflow optimization, including reading files in batches

10.1 Datasets available in installed packages

First of all, before we load any external datasets, the base package provides a variety of toy datasets in the package datasets. data() lists all the datasets in datasets.

Additionally, many add-on packages contain datasets that accompany the functions for users to test the code and explore the functions. We can load those datasets after loading the package. data(package = .packages(all.available = TRUE)) lists all datasets in all available packages.

To load a dataset from a package, we can use the function data(). For example, data("starwars", package = "dplyr") loads the starwars dataset in the package dplyr.

library(dplyr)
data("starwars", package = "dplyr")
head(starwars, 3)

## # A tibble: 3 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  <chr>     <chr>  
## 1 Luke Sky…    172    77 blond      fair       blue              19 male  mascu… Tatooine  Human  
## 2 C-3PO        167    75 <NA>       gold       yellow           112 none  mascu… Tatooine  Droid  
## 3 R2-D2         96    32 <NA>       white, bl… red               33 none  mascu… Naboo     Droid  
## # ℹ 3 more variables: films <list>, vehicles <list>, starships <list>

We can also use the pattern pkg::name to load an object from a package.

starwars <- dplyr::starwars
head(starwars, 3)

## # A tibble: 3 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  <chr>     <chr>  
## 1 Luke Sky…    172    77 blond      fair       blue              19 male  mascu… Tatooine  Human  
## 2 C-3PO        167    75 <NA>       gold       yellow           112 none  mascu… Tatooine  Droid  
## 3 R2-D2         96    32 <NA>       white, bl… red               33 none  mascu… Naboo     Droid  
## # ℹ 3 more variables: films <list>, vehicles <list>, starships <list>

quick notes on `tidyverse`’s tibble

dplyr functions return a tidyverse’s tibble rather than a standard data frame. Tibbles behave much like data frames, but they’re designed for tidyverse workflows and offer more readable printing, especially with large datasets. This won’t change the way we work with “data frames” fundamentally.

10.2 Reading tabular text files

In terms of external datasets, the format of data that we may be most familiar with is probably spreadsheet-like text files. These are structured data in a tabular, or rectangular, shape.

In these files, each line is a record/case/observation. A record contains variables (columns/fields/dimensions/features) associated with that record; the variables are often separated by a one-character delimiter, such as a whitespace, tab, colon, or comma. Each record contains the same number of columns.

Tabular data is commonly stored in comma-separated values and tab-delimited values files. The comma-separated values file, CSV file, is a popular format for storing data, and many programs can import and export data in that format.

In base R, the function read.table() and its family members read.csv() and read.delim() can read tabular text files. read.csv() and read.delim() (and a few others) are convenience functions based on read.table() to read comma-separated and tab-separated files.

sp500tickers <- read.csv("sp500tickers.csv")
head(sp500tickers)

readr provides read.table() equivalents that are faster than base R functions. As with the read.table() family, read_delim() reads a delimited file, and read_csv() and read_tsv() are special cases of the general read_delim() function.

library(readr)
sp500tickers <- read_csv("sp500tickers.csv")
head(sp500tickers)

Both groups of functions read data from our working directory, and then create a data frame.

We can also use both functions to read CSV files via a URL. They substitute a URL for a file name.

read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

The readr way:

read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

header, delimiter, skipping lines, encoding

When reading files into R, there are a few elements in a file that we should keep an eye on. Does the first line of the file contain the variable names? What is the correct field delimiter? Do the first few lines need to be skipped (the “junk” rows), and how many? Besides, we need to make sure that character fields are read in the correct encoding (e.g., when we work with Chinese characters), unless the file to be imported from is entirely in ASCII.

Let’s see an sample file. In this dataset, the first three lines have no data. The fourth line contains variable names. Then the next line is still empty.

##  [1] " "                                                                                                                                                                                                                                                               
##  [2] "                                                                                                                                                                                                                                                                "
##  [3] " "                                                                                                                                                                                                                                                               
##  [4] "ticker      evtdate       car"                                                                                                                                                                                                                                   
##  [5] ""                                                                                                                                                                                                                                                                
##  [6] " EGAS     12FEB2014     0.02356"                                                                                                                                                                                                                                 
##  [7] " EGAS     27JUL2012    -0.00276"                                                                                                                                                                                                                                 
##  [8] " HGR      03JUL2002     0.01478"                                                                                                                                                                                                                                 
##  [9] " HGR      03MAY2005    -0.05070"                                                                                                                                                                                                                                 
## [10] " HGR      05AUG2002     0.02337"

If we use base R’s read.table() to import the file, the first few empty lines are skipped. By default, header is FALSE; the delimiter is whitespace.

output <- read.table("output.txt")

head(output)

##       V1        V2       V3
## 1 ticker   evtdate      car
## 2   EGAS 12FEB2014  0.02356
## 3   EGAS 27JUL2012 -0.00276
## 4    HGR 03JUL2002  0.01478
## 5    HGR 03MAY2005 -0.05070
## 6    HGR 05AUG2002  0.02337

We can turn on the header to be TRUE, and manually set the number of lines to skip.

output <- read.table("output.txt", 
               skip = 3,
               header = TRUE)

head(output)

##   ticker   evtdate      car
## 1   EGAS 12FEB2014  0.02356
## 2   EGAS 27JUL2012 -0.00276
## 3    HGR 03JUL2002  0.01478
## 4    HGR 03MAY2005 -0.05070
## 5    HGR 05AUG2002  0.02337
## 6    HGR 20JAN2004  0.04206

We can also use the readr function to read tabular data.

library(readr)
output2 <- read_table("output.txt", 
                      col_names = TRUE, 
                      skip = 3)

head(output2)

## # A tibble: 6 × 3
##   ticker evtdate        car
##   <chr>  <chr>        <dbl>
## 1 EGAS   12FEB2014  0.0236 
## 2 EGAS   27JUL2012 -0.00276
## 3 HGR    03JUL2002  0.0148 
## 4 HGR    03MAY2005 -0.0507 
## 5 HGR    05AUG2002  0.0234 
## 6 HGR    20JAN2004  0.0421

Sometimes, the file doesn’t have a header row at all. We can pass a character vector directly to col_names and provide our own clean names.

output3 <- read_table("output.txt", 
                 skip = 5, 
                 col_names = c("Ticker", "Date", "Price"))

head(output3)

## # A tibble: 6 × 3
##   Ticker Date         Price
##   <chr>  <chr>        <dbl>
## 1 EGAS   12FEB2014  0.0236 
## 2 EGAS   27JUL2012 -0.00276
## 3 HGR    03JUL2002  0.0148 
## 4 HGR    03MAY2005 -0.0507 
## 5 HGR    05AUG2002  0.0234 
## 6 HGR    20JAN2004  0.0421

more on controlling input file

Both read.table() and readr functions have many parameters for controlling how they read and interpret the input file. Please check the help pages for more details.

“Import Dataset” feature

In RStudio, we can also use its “Import Dataset” feature conveniently to read files.

10.3 Writing to CSV files

If we want to save a data frame to a CSV file, one option is to use write.csv(), or write_csv() from readr.

write_csv(x, "filename.csv")

The function takes a data frame and a file to write to. The file is saved to our working directory. By default the columns names are written to the file (col_names = TRUE).

10.4 Reading other text documents for storing data

In addition to delimiter-separated values files, there are many more formats and standards of text documents for storing data. These include JavaScript Object Notation (JSON), and eXtensible Markup Language (XML).

If you collect data from the Internet, most likely data is stored in JSON formats. JSON is the de facto standard for web applications to pass data between themselves. However, JSON is not native to R. Therefore, we need to translate it to a data structure friendly to R, such as a list, or a data frame.

In R, rjson and jsonlite, among other packages, read and write JSON data. jsonlite has additional cleaning functionalities (e.g., flattening nested data frames) and can read ndjson data.

If we are lucky enough, we can use jsonlite::fromJSON() to directly read JSON objects into R. But quite often, we need to take extra steps before R can readily read JSON objects. For instance, if we download the Yelp dataset from Kaggle, we need a middleman to help read those datasets to R data structures. See the discussions below.

10.5 Reading unstructured texts

To read unstructured texts that are not organized into tabular forms, we can use readLines() from base R or read_lines() from the package readr. These functions read individual lines and then process them as strings to extract data items. The functions will return a character vector, with one value corresponding to each row in the file.

Back to the Yelp dataset:

library(jsonlite)
con <- read_lines("yelp_academic_dataset_business.json")
df <- stream_in(textConnection(con))

stream_in() implements line-by-line processing of JSON data in the ndjson format over a connection (e.g., url, file).

Reading unstructured data as corpora for text analysis is another story. In these cases, we can use packages that are developed specifically for text mining or natural language processing.

10.6 Reading Excel files

Package readxland package xlsx allow us to deal with Excel files of different versions conveniently. Note that xlsx requires Java.

library(readxl)
dataset <- read_excel()

Excel file format is proprietary. When you have a choice, do not save your files to proprietary file formats so that your data will not be locked into a platform that you lack control of.

10.7 Reading Stata, SPSS, SAS files

Package haven provides methods to read data from the major statistical software Stata, SPSS and SAS.

SPSS

library(haven)
dataset <- read_sav()

Stata

library(haven)
dataset <- read_stata()

SAS

library(haven)
dataset <- read_sas()

Package foreign reads data stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, and dBase. Note that foreign does not support Stata after version 12.

10.8 Reading data from databases

There are many R packages developed to work with relational and non-relational databases. Look up for the package that can connect to your database on this CRAN Task View page Databases with R.

10.9 Reading many files

Sometimes, we have hundreds of datasets in our working directory to read into our workspace. They are stored in the same file format, and record the same sets of attributes of a topic we are probing into.

As a first step, we can use list.files() to list all the files in the working directory, and observe if there is a pattern pertinent to all files. For instance, they have the same suffix, or their file names contain the same string.

Next, we may append them vertically into a new data frame in a batch.

library(purrr)
library(readr)

# creates an empty data frame
df <- data.frame()

# list.files() returns a character vector storing all .CSV files  
files <- list.files(pattern = "*.csv")

# Read each CSV file and bind it to the existing data frame
df <- map_dfr(files, read_csv)

10.10 Saving and loading objects

We can save the objects we have created in an R session with save(), and load them into our workspace later with load().

save(object, file = "object")
load("object")

10.11 Collecting data from the web

To read data from the Internet into R directly, we either download them via an API provided by the online service that we would like to access, or scrape the data contained inside an HTML or XML page. We’ll go back to this topic in the chapter Collecting Web Data.