10 Inputs and Outputs
Data comes from various sources and in various formats. They can be structured and unstructured. They can be numbers, texts, networks, audio, visual and more.
In this chapter, we discuss how to read and write files in R. We’ll primarily be working with structured data. We’ll discuss unstructured textual data in the case study in the final module.
In R, we will be primarily working with data frames. A data frame is the recommended R data structure for storing and representing tabular data. If the data is not already a data frame, we will turn it into a format that R is able to process.
In the sections below, we will focus on three topics:
- Internal datasets and standard flat files (e.g., CSV/TSV)
- Specialized formats such as JSON, and proprietary software, including Excel and stats software (e.g., SAS/SPSS)
- Workflow optimization, including reading files in batches
10.1 Datasets available in installed packages
First of all, before we load any external datasets, the base package provides a variety of toy datasets in the package datasets. data() lists all the datasets in datasets.
Additionally, many add-on packages contain datasets that accompany the functions for users to test the code and explore the functions. We can load those datasets after loading the package. data(package = .packages(all.available = TRUE)) lists all datasets in all available packages.
To load a dataset from a package, we can use the function data(). For example, data("starwars", package = "dplyr") loads the starwars dataset in the package dplyr.
## # A tibble: 3 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Luke Sky… 172 77 blond fair blue 19 male mascu… Tatooine Human
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooine Droid
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… Naboo Droid
## # ℹ 3 more variables: films <list>, vehicles <list>, starships <list>
We can also use the pattern pkg::name to load an object from a package.
## # A tibble: 3 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Luke Sky… 172 77 blond fair blue 19 male mascu… Tatooine Human
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooine Droid
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… Naboo Droid
## # ℹ 3 more variables: films <list>, vehicles <list>, starships <list>
quick notes on tidyverse’s tibble
dplyr functions return a tidyverse’s tibble rather than a standard data frame. Tibbles behave much like data frames, but they’re designed for tidyverse workflows and offer more readable printing, especially with large datasets. This won’t change the way we work with “data frames” fundamentally.
10.2 Reading tabular text files
In terms of external datasets, the format of data that we may be most familiar with is probably spreadsheet-like text files. These are structured data in a tabular, or rectangular, shape.
In these files, each line is a record/case/observation. A record contains variables (columns/fields/dimensions/features) associated with that record; the variables are often separated by a one-character delimiter, such as a whitespace, tab, colon, or comma. Each record contains the same number of columns.
Tabular data is commonly stored in comma-separated values and tab-delimited values files. The comma-separated values file, CSV file, is a popular format for storing data, and many programs can import and export data in that format.
In base R, the function read.table() and its family members read.csv() and read.delim() can read tabular text files. read.csv() and read.delim() (and a few others) are convenience functions based on read.table() to read comma-separated and tab-separated files.
readr provides read.table() equivalents that are faster than base R functions. As with the read.table() family, read_delim() reads a delimited file, and read_csv() and read_tsv() are special cases of the general read_delim() function.
Both groups of functions read data from our working directory, and then create a data frame.
We can also use both functions to read CSV files via a URL. They substitute a URL for a file name.
The readr way:
header, delimiter, skipping lines, encoding
When reading files into R, there are a few elements in a file that we should keep an eye on. Does the first line of the file contain the variable names? What is the correct field delimiter? Do the first few lines need to be skipped (the “junk” rows), and how many? Besides, we need to make sure that character fields are read in the correct encoding (e.g., when we work with Chinese characters), unless the file to be imported from is entirely in ASCII.
Let’s see an sample file. In this dataset, the first three lines have no data. The fourth line contains variable names. Then the next line is still empty.
## [1] " "
## [2] " "
## [3] " "
## [4] "ticker evtdate car"
## [5] ""
## [6] " EGAS 12FEB2014 0.02356"
## [7] " EGAS 27JUL2012 -0.00276"
## [8] " HGR 03JUL2002 0.01478"
## [9] " HGR 03MAY2005 -0.05070"
## [10] " HGR 05AUG2002 0.02337"
If we use base R’s read.table() to import the file, the first few empty lines are skipped. By default, header is FALSE; the delimiter is whitespace.
## V1 V2 V3
## 1 ticker evtdate car
## 2 EGAS 12FEB2014 0.02356
## 3 EGAS 27JUL2012 -0.00276
## 4 HGR 03JUL2002 0.01478
## 5 HGR 03MAY2005 -0.05070
## 6 HGR 05AUG2002 0.02337
We can turn on the header to be TRUE, and manually set the number of lines to skip.
## ticker evtdate car
## 1 EGAS 12FEB2014 0.02356
## 2 EGAS 27JUL2012 -0.00276
## 3 HGR 03JUL2002 0.01478
## 4 HGR 03MAY2005 -0.05070
## 5 HGR 05AUG2002 0.02337
## 6 HGR 20JAN2004 0.04206
We can also use the readr function to read tabular data.
## # A tibble: 6 × 3
## ticker evtdate car
## <chr> <chr> <dbl>
## 1 EGAS 12FEB2014 0.0236
## 2 EGAS 27JUL2012 -0.00276
## 3 HGR 03JUL2002 0.0148
## 4 HGR 03MAY2005 -0.0507
## 5 HGR 05AUG2002 0.0234
## 6 HGR 20JAN2004 0.0421
Sometimes, the file doesn’t have a header row at all. We can pass a character vector directly to col_names and provide our own clean names.
output3 <- read_table("output.txt",
skip = 5,
col_names = c("Ticker", "Date", "Price"))
head(output3)## # A tibble: 6 × 3
## Ticker Date Price
## <chr> <chr> <dbl>
## 1 EGAS 12FEB2014 0.0236
## 2 EGAS 27JUL2012 -0.00276
## 3 HGR 03JUL2002 0.0148
## 4 HGR 03MAY2005 -0.0507
## 5 HGR 05AUG2002 0.0234
## 6 HGR 20JAN2004 0.0421
10.3 Writing to CSV files
If we want to save a data frame to a CSV file, one option is to use write.csv(), or write_csv() from readr.
write_csv(x, "filename.csv")
The function takes a data frame and a file to write to. The file is saved to our working directory. By default the columns names are written to the file (col_names = TRUE).
10.4 Reading other text documents for storing data
In addition to delimiter-separated values files, there are many more formats and standards of text documents for storing data. These include JavaScript Object Notation (JSON), and eXtensible Markup Language (XML).
If you collect data from the Internet, most likely data is stored in JSON formats. JSON is the de facto standard for web applications to pass data between themselves. However, JSON is not native to R. Therefore, we need to translate it to a data structure friendly to R, such as a list, or a data frame.
In R, rjson and jsonlite, among other packages, read and write JSON data. jsonlite has additional cleaning functionalities (e.g., flattening nested data frames) and can read ndjson data.
If we are lucky enough, we can use jsonlite::fromJSON() to directly read JSON objects into R. But quite often, we need to take extra steps before R can readily read JSON objects. For instance, if we download the Yelp dataset from Kaggle, we need a middleman to help read those datasets to R data structures. See the discussions below.
10.5 Reading unstructured texts
To read unstructured texts that are not organized into tabular forms, we can use readLines() from base R or read_lines() from the package readr. These functions read individual lines and then process them as strings to extract data items. The functions will return a character vector, with one value corresponding to each row in the file.
Back to the Yelp dataset:
library(jsonlite)
con <- read_lines("yelp_academic_dataset_business.json")
df <- stream_in(textConnection(con))stream_in() implements line-by-line processing of JSON data in the ndjson format over a connection (e.g., url, file).
Reading unstructured data as corpora for text analysis is another story. In these cases, we can use packages that are developed specifically for text mining or natural language processing.
10.6 Reading Excel files
Package readxland package xlsx allow us to deal with Excel files of different versions conveniently. Note that xlsx requires Java.
library(readxl)
dataset <- read_excel()
Excel file format is proprietary. When you have a choice, do not save your files to proprietary file formats so that your data will not be locked into a platform that you lack control of.
10.7 Reading Stata, SPSS, SAS files
Package haven provides methods to read data from the major statistical software Stata, SPSS and SAS.
SPSS
library(haven)
dataset <- read_sav()
Stata
library(haven)
dataset <- read_stata()
SAS
library(haven)
dataset <- read_sas()
Package foreign reads data stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, and dBase. Note that foreign does not support Stata after version 12.
10.8 Reading data from databases
There are many R packages developed to work with relational and non-relational databases. Look up for the package that can connect to your database on this CRAN Task View page Databases with R.
10.9 Reading many files
Sometimes, we have hundreds of datasets in our working directory to read into our workspace. They are stored in the same file format, and record the same sets of attributes of a topic we are probing into.
As a first step, we can use list.files() to list all the files in the working directory, and observe if there is a pattern pertinent to all files. For instance, they have the same suffix, or their file names contain the same string.
Next, we may append them vertically into a new data frame in a batch.