5 Data Structures

Data structures are akin to various containers that store data values. They define how objects are stored in R, and they can store multiple types of values.

We’ve met vectors already. A vector is the basic building block: it stores multiple values of the same type in a single object. Once we understand vectors, the other data structures follow naturally.

Factors. We can think about factors as vectors with categorical labels.
Matrices and arrays. A matrix is an extension of a vector to two dimensions. An array is a multidimensional vector.
Lists. Lists are a general form of vector in which the various elements need not be of the same type. Lists can contain other objects, such as vectors, lists and data frames.
Data frames. Data frames are lists of matrix-like structures, in which the columns can be of different types.

In this chapter, we’ll walk through each type of data structure in terms of what they are and how they work.

5.1 Factor

A factor is a specialized tool for handling categorical data. It is useful when values belong to a fixed set of groups, such as “low”, “medium”, and “high”.

Unlike a regular character vector, which treats each value as a standalone string, a factor encodes discrete categories, called levels (explained below), and assigning each observation to one of them.

We can create a factor from a regular vector with factor(). When we do this, R scans our data to identify every unique category present.

flavor <- c("chocolate", "vanilla", "strawberry", "mint", "coffee", "strawberry", "vanilla", "pistachio")
flavor_f <- factor(flavor)
flavor_f

## [1] chocolate  vanilla    strawberry mint       coffee     strawberry vanilla    pistachio 
## Levels: chocolate coffee mint pistachio strawberry vanilla

levels

The most important attribute of a factor is its levels. Levels are the distinct categories that a factor can take.

attributes(flavor_f)

## $levels
## [1] "chocolate"  "coffee"     "mint"       "pistachio"  "strawberry" "vanilla"   
## 
## $class
## [1] "factor"

levels() gets those categories of a factor.

levels(flavor_f)

## [1] "chocolate"  "coffee"     "mint"       "pistachio"  "strawberry" "vanilla"

nlevels() returns the number of categories of a factor.

nlevels(flavor_f)

## [1] 6

By default, R sorts factor levels alphabetically. This is fine for ice cream flavors, but it’s a headache when we want to visualize data in a specific order (like “Small, Medium, Large”).

We can manually set the order of the levels using the argument levels.

factor(flavor)

## [1] chocolate  vanilla    strawberry mint       coffee     strawberry vanilla    pistachio 
## Levels: chocolate coffee mint pistachio strawberry vanilla

factor(flavor, levels = c("strawberry", "vanilla", "chocolate", "coffee", "mint", "pistachio"))

## [1] chocolate  vanilla    strawberry mint       coffee     strawberry vanilla    pistachio 
## Levels: strawberry vanilla chocolate coffee mint pistachio

Furthermore, if the categories have a natural rank (i.e., where one is “greater” than another), we can set ordered = TRUE to create an ordinal factor.

For example, we conducted a survey and asked respondents how they felt about the statement “A.I. is going to change the world.” Respondents gave one of the following responses: Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree.

survey_results <- factor(
c("Disagree", "Neutral", "Strongly Disagree", "Neutral", "Agree", "Strongly Agree", "Disagree", "Strongly Agree", "Neutral", "Strongly Disagree", "Neutral", "Agree"),
levels = c("Strongly Disagree", "Disagree",
"Neutral", "Agree", "Strongly Agree"),
ordered = TRUE)

survey_results

##  [1] Disagree          Neutral           Strongly Disagree Neutral          
##  [5] Agree             Strongly Agree    Disagree          Strongly Agree   
##  [9] Neutral           Strongly Disagree Neutral           Agree            
## Levels: Strongly Disagree < Disagree < Neutral < Agree < Strongly Agree

Note: Factors are especially useful in modeling and plotting because R can treat categories differently from ordinary text.

5.2 Matrix

A matrix is what we get when we take a vector and give it two dimensions: rows and columns.

For example, 1:6 is a vector.

a <- 1:6
dim(a) # initially NULL

## NULL

If we assign dimensions, R will display it as a matrix.

dim(a) <- c(2, 3)
a

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

In practice, we usually create matrices directly with matrix(), and specify the numbers of rows and columns.

a <- matrix(data = 1:6, nrow = 2, ncol = 3)
a

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Note: As with vectors, all elements in a matrix must be of the same basic type. If we mix types, R will coerce them to a common type.

matrix indexing

We can refer to part of a matrix using the indexing operator [].

Second row and second column:

a[2, 2]

## [1] 4

First two rows and first two columns:

a[1:2, 1:2]

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

First row:

a[1,]

## [1] 1 3 5

First column:

a[,1]

## [1] 1 2

`cbind()`, `rbind()`

cbind() and rbind() combine matrices together by binding columns and rows.

m1 <- matrix(1:9, ncol = 3, nrow = 3)
m2 <- matrix(10:12, ncol =1, nrow = 3)
m3 <- matrix(10:12, ncol = 3, nrow = 1)

cbind() bind by columns.

m1

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

m2

##      [,1]
## [1,]   10
## [2,]   11
## [3,]   12

cbind(m1, m2)

##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

rbind() bind by rows.

m1

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

m3

##      [,1] [,2] [,3]
## [1,]   10   11   12

rbind(m1, m3)

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## [4,]   10   11   12

5.3 Array

A matrix extends a vector to two dimensions. An array extends a vector to more dimensions. So a matrix is really just a special case of an array.

An array is a multidimensional vector.

b <- 1:12
dim(b) <- c(2, 3, 2)
b

## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

More commonly, we create arrays using array().

b <- array(1:12, dim = c(2,3,2))
b

## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

We can index arrays by providing one position for each dimension.

b[1, 2, 1]

## [1] 3

b[, , 2]

##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

Like matrices, arrays store a single atomic type. The underlying storage mechanism for an array (including a matrix) is a vector.

a <- matrix(data = 1:6, nrow = 2, ncol = 3)
mode(a)

## [1] "numeric"

class(a)

## [1] "matrix" "array"

b <- array(1:12, dim = c(2,3,2))
mode(b)

## [1] "numeric"

class(b)

## [1] "array"

5.4 List

So far, every data structure we have seen ultimately behaves like a single-type vector. A list is different.

A list is a vector where each element can be of a different data type. A list can contain numbers, character vectors, matrices, data frames, or even other lists. This makes lists extremely flexible.

To generate a list, we use list(). We can also name each component in a list.

book <- list(title = "Nineteen Eighty-Four: A Novel", 
             author = "George Orwell", 
             published_year = 1949, 
             pages = 328)
book

## $title
## [1] "Nineteen Eighty-Four: A Novel"
## 
## $author
## [1] "George Orwell"
## 
## $published_year
## [1] 1949
## 
## $pages
## [1] 328

list indexing

Lists can be indexed by position or name.

By position:

book[3]

## $published_year
## [1] 1949

book[-3]

## $title
## [1] "Nineteen Eighty-Four: A Novel"
## 
## $author
## [1] "George Orwell"
## 
## $pages
## [1] 328

book[[3]]

## [1] 1949

book[c(2, 3)]

## $author
## [1] "George Orwell"
## 
## $published_year
## [1] 1949

By name using $ or [[""]]:

book$title

## [1] "Nineteen Eighty-Four: A Novel"

book[["title"]]

## [1] "Nineteen Eighty-Four: A Novel"

book[c("title", "author")]

## $title
## [1] "Nineteen Eighty-Four: A Novel"
## 
## $author
## [1] "George Orwell"

Note: The distinction between [] and [[]] is very important. [] returns a sublist, while [[]] or $ extracts the element itself.

book[3]

## $published_year
## [1] 1949

book[[3]]

## [1] 1949

The first result is still a list; the second is the stored value.

A list can contain other lists.

The fact that a list can contain a list makes it a recursive object in R. Functions can also be recursive, which we’ll discuss later.

books <- list("this list references another list", book)
books

## [[1]]
## [1] "this list references another list"
## 
## [[2]]
## [[2]]$title
## [1] "Nineteen Eighty-Four: A Novel"
## 
## [[2]]$author
## [1] "George Orwell"
## 
## [[2]]$published_year
## [1] 1949
## 
## [[2]]$pages
## [1] 328

To access nested elements, we can stack up the square brackets.

books[[2]][["pages"]]

## [1] 328

Lists provide a convenient way to return the results of a statistical computation.

5.5 Data frame

A data frame can be understood as a special kind of list: it is a named list of equal-length vectors displayed as a table.

It has rows and columns. Each column can store a different type of data of the same length. The columns must have names. The components of the data frame can be vectors, factors, lists, or other data frames.

That means a data frame combines two ideas:

like a list, its columns can have different types;
like a matrix, it has rows and columns.

Data frames are particularly good for representing observational data.

To create a data frame, use data.frame().

laureate <- c("Bob Dylan", "Mo Yan", "Ernest Hemingway", "Winston Churchill", "Bertrand Russell")
year <- c(2016, 2012, 1954, 1953, 1950)
country <- c("United States", "China", "United States", "United Kingdom", "United Kingdom")
genre <- c("poetry, songwriting", "novel, short story", "novel, short story, screenplay", "history, essay, memoirs", "philosophy")

nobel_prize_literature <- data.frame(laureate, year, country, genre)
nobel_prize_literature

##            laureate year        country                          genre
## 1         Bob Dylan 2016  United States            poetry, songwriting
## 2            Mo Yan 2012          China             novel, short story
## 3  Ernest Hemingway 1954  United States novel, short story, screenplay
## 4 Winston Churchill 1953 United Kingdom        history, essay, memoirs
## 5  Bertrand Russell 1950 United Kingdom                     philosophy

Note: Although a data frame looks like a matrix, it is not a matrix. It is a list interpreted as a data frame.

mode(nobel_prize_literature)

## [1] "list"

class(nobel_prize_literature)

## [1] "data.frame"

A data frame is a list with class “data.frame”.

data frame indexing

Because a data frame is both list-like and matrix-like, we can index it in two ways.

By column name, using list-style extraction $ or [[]].

nobel_prize_literature$laureate

## [1] "Bob Dylan"         "Mo Yan"            "Ernest Hemingway"  "Winston Churchill"
## [5] "Bertrand Russell"

nobel_prize_literature[["laureate"]]

## [1] "Bob Dylan"         "Mo Yan"            "Ernest Hemingway"  "Winston Churchill"
## [5] "Bertrand Russell"

By row and column position, using matrix-style indexing.

nobel_prize_literature[1,]

##    laureate year       country               genre
## 1 Bob Dylan 2016 United States poetry, songwriting

Logical conditions are allowed, and actually frequently used.

nobel_prize_literature$laureate[nobel_prize_literature$country == "United Kingdom"]

## [1] "Winston Churchill" "Bertrand Russell"

nobel_prize_literature$country == "United Kingdom"

## [1] FALSE FALSE FALSE  TRUE  TRUE

This works because the condition creates a logical vector that selects matching rows.

5.6 Package-specific object classes

When calling functions from an add-on package, we often get returned objects specific to that package. For instance, quantmod return xts or zoo objects, and tidyquant return data in “tidy” forms, such as tbl_df and tbl. However, underneath these peculiar names, ultimately these objects are R data structures.