3 Speaking R
This chapter introduces the R environment.
3.1 Working directory
When using R, the first step suggested is to set a working directory. A working directory is the default location for working with the ongoing tasks, including reading and writing data files, opening and saving scripts, and saving the workspace image. It is a folder that we visit for a problem we are working on.
The working directory is set with setwd(). If we are in RStudio, we can also use the drop-down menu Session–>Set Working Directory–>Choose Directory… to conveniently change the working directory.
The current working directory can be checked with getwd().
It is recommended that we use separate working directories for different projects to keep files organized and make each project easier to share. Working this way ensures each project is self-contained: everything it needs (data files, scripts, and outputs such as plots or cleaned data) lives inside its own folder rather than being scattered elsewhere.
3.2 R command
Now if we move to the console, the R program issues a prompt >, which is waiting for our input commands.
using R interactively
To get started, let’s first treat R as a calculator. When we enter an expression at the command prompt, R will evaluate the expression, print the result, or respond with an error message.
## [1] 2
## [1] 3
## [1] 2
## [1] 1
## [1] 1.000000e+00 7.071068e-01 6.123234e-17 -1.000000e+00
## Error in `log()`:
## ! non-numeric argument to mathematical function
Unlike many other programming languages, we can output code in R without using a print() function explicitly.
## [1] 2
This produces the same output as:
## [1] 2
When an expression is entered at the prompt, R evaluates it and automatically calls print() on the result.
The print() function provides a generic way to display R objects.
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
built-in constants
R has a small number of built-in constants pi, LETTERS, letters, month.abb, and month.name.
pi is the ratio of the circumference of a circle to its diameter.
## [1] 3.141593
LETTERS is the 26 upper-case letters of the Roman alphabet.
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W"
## [24] "X" "Y" "Z"
letters is the 26 lower-case letters of the Roman alphabet.
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w"
## [24] "x" "y" "z"
month.abb is the three-letter abbreviations for the English month names.
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
month.name is the English names for the months of the year.
## [1] "January" "February" "March" "April" "May" "June" "July"
## [8] "August" "September" "October" "November" "December"
What is “[1]” that accompanies each returned value?
It means that the index of the first item displayed in the row is 1.
In R, any number that we enter in the console is interpreted as a vector. A vector is an ordered collection of numbers. We’ll see what vector is in the next chapter.
assignment
Like most other languages, R lets us assign values to variables and refer to them by name.
The assignment operator is <-.
An assignment evaluates an expression, and passes the value to a variable. But the result is not automatically printed. The value is stored in the variable that we have defined. That would be a in the example below.
To print the variable value, we simply type the variable name (a in this case).
## [1] 2
Left-to-right assignment also works, but it is unconventional and not recommended.
## [1] 2
A single equal sign = can also be used as an assignment operator (also not recommended). In other programming languages, it is common to use = as an assignment operator. But in R the = operator can be forbidden in some situations. In general, <- is preferred. There is an in-depth discussion on the differences between = and <- assignment operators on Stack Overflow.
If the object already exists, its previous value is overwritten.
## [1] 2
## [1] 4
variable names
A variable name must start with a period (.) or a letter; if it starts with a . the second character must not be a digit. Names cannot start with a number or an underscore _.
Variable names in R are case-sensitive. For instance, age, Age and AGE are three different variables.
Certain words are reserved and therefore cannot be used as variable names. These include TRUE, FALSE, NULL, NA, if, else, while, function, for, next, break, repeat, and a few others. We can view the list of reserved words in R by typing ?Reserved at the console.
R places no limit on the length of variable names.
Whichever naming convention you choose, make sure that you keep to one convention and use it consistently throughout your code.
R command is case sensitive
Variables A and a are different.
Functions nrow() and NROW() are different.
incomplete commands
If a command is not complete at the end of a line, R will give a continuation prompt (+) on the following lines and keeps reading input until the expression is finished.
If you want to abandon an incomplete command, press esc to return to the regular prompt.
auto-completion
R includes automatic completions for object names.
Type something in your console, and use the tab key to view the list of possible completions for the object you are trying to create.
recalling previous commands; command-line editing
R makes it easy to recall, edit, and rerun previous commands.
We can scroll through our command history using the up and down arrow keys to revisit earlier inputs.
Once a command is on the line, we can use the left and right arrow keys to move within it. We can then tweak the code before reusing it.
source()
After finish working on a problem, it’s often helpful to keep a record of every step that we have taken.
We can save our commands in an external script file (e.g., project1.R) in the working directory. Later we will be able to use the source() function to read and execute the code without having to retype it.
source("project1.R")
If we are working on a larger project, it’s common to break a long workflow into several smaller scripts: one for data cleaning, one for modeling, one for graphics, and so on. Each script can then be executed individually with source().
source("cleaning.R")
source("models.R")
source("graphics.R")
3.3 Comment
Comment starts with a hashmark #.
When R executes code, it ignores anything that follows # on the same line.
This makes comments a useful way to explain what your code is doing, leave notes for yourself or collaborators, or temporarily disable a line of code while testing alternatives.
Comments can appear on their own line or at the end of a line of code:
## [1] 2
We can also use comments to document more complex ideas or to structure our script:
# Load data ---------------------------------------------------------------
data <- read.csv("sales.csv")
# Clean data --------------------------------------------------------------
data$region <- factor(data$region) # convert to factor
data <- na.omit(data) # remove missing rows
# Model -------------------------------------------------------------------
fit <- lm(revenue ~ region + month, data = data)
summary(fit)Comments are also handy when experimenting:
To create multiline comments, we need to insert a # for each line.
3.4 Object
The entities that R creates and manipulates are known as objects. Everything in R is an object. These may be numeric vectors, character strings, lists, functions, etc. For now, let’s just think about an object as a “thing” that is represented by the computer.
workspace
The collection of objects currently stored in memory is called the workspace.
The function ls() displays the names of the objects in our workspace.
## [1] "a" "AAPL" "admitted"
## [4] "admitted2" "AFL" "age"
## [7] "applicants" "applicants2" "area"
## [10] "area2" "b" "base_stocks"
## [13] "BC" "book" "books"
## [16] "c" "c0" "c1"
## [19] "c2" "c3" "country"
## [22] "d" "date" "date_string"
## [25] "demo" "deposit" "exclude_trades"
## [28] "f" "file1" "file2"
## [31] "file3" "file4" "file5"
## [34] "file5_2" "file6" "file7"
## [37] "flavor" "flavor_f" "founded"
## [40] "Founded1" "Founded2" "Founded3"
## [43] "fun" "function_letters" "g"
## [46] "gcd" "genre" "get_divisor"
## [49] "get_intersect" "get_score" "i"
## [52] "incl_nt" "laureate" "lcm"
## [55] "len" "m1" "m2"
## [58] "m3" "MMM" "MMM_SMA"
## [61] "MMM_SMA_EMA" "my_function" "myEMA"
## [64] "mylist" "myMom" "mySMA"
## [67] "n" "neu_set" "no"
## [70] "nobel_prize_literature" "now" "nth"
## [73] "num" "output" "output2"
## [76] "output3" "pattern" "pop"
## [79] "pop_den" "pop2" "qm"
## [82] "result" "result1" "result2"
## [85] "sale_cond" "sample" "score_sum"
## [88] "scores" "sent_compound" "sent_tokens"
## [91] "sp500" "sp500stocks" "sp500tickers"
## [94] "square" "starwars" "state.df"
## [97] "state.list" "stocks_br" "string"
## [100] "string1" "string2" "survey_results"
## [103] "t" "test" "time"
## [106] "tq" "tq_to_wide" "tri"
## [109] "tweets" "u" "v"
## [112] "v1" "v2" "v3"
## [115] "v4" "values" "vec"
## [118] "vec1" "vec2" "w"
## [121] "x" "y" "year"
## [124] "year1" "year2" "year3"
## [127] "years" "years_1" "years_2"
## [130] "yes" "z"
To remove objects from our workspace, we use the function rm(). There is no “undo”; once the variable is gone, it’s gone.
We can remove all the objects in memory. This erases our entire workspace at once.
Now, our workspace is empty. ls() returns an empty vector.
## character(0)
We can save all the current objects with save.image(). The workspace will be written to a .Rdata file in the current working directory. We will be able to reload the workspace from this file when R is started at later time from the same directory.
3.5 Function
Functions are at the heart of how R works. In fact, almost every operation we perform in R is ultimately carried out by a function. When we type an expression at the console, R parses it, rewrites it internally in functional form, and then evaluates that function to produce a result.
A function is an object that takes some input objects (called arguments) and returns an output object. Most functions follow the familiar structure:
f(argument1, argument2, ...)
We’ve already met with several built‑in functions (e.g., print(), mean()). R comes with a large collection of these built‑in functions, covering everything from basic arithmetic to statistical modeling and graphics.
In addition to built‑in functions, R allows us to write our own functions. User‑defined functions are especially useful when we want to bundle a sequence of operations into a single reusable step. For example:
## [1] 25
Whether built‑in or user‑written, functions help us structure our work, avoid repetition, and make our code easier to read and maintain. We’ll discuss functions in detail in the chapter Functions and learn how to write our own functions.
3.6 Package
Packages are the primary way R is extended beyond its core capabilities.
A package is a collection of related functions, documentation, and data bundled together so that it can be shared and reused. In R, packages are the fundamental unit of distributable code.
The design philosophy behind R is to build smaller, specialized tools that each does one thing well, instead of large programs that do everything. As a result, we’ll find packages dedicated to graphics, statistical methods, machine learning, text analysis, and countless other tasks.
When we install R, we automatically receive the base packages. These packages contain the essential functions that allow R to work, and are loaded by default.
Beyond the base system, thousands of additional packages are available from repositories such as CRAN. These add‑on packages provide specialized methods, or accompany textbooks and research projects. We can also create our own packages to share code or to organize our work in a reusable form.
In the first half of this bootcamp, we will focus on the base R packages to understand how the language works. After we transition to the more practical side of the world, we will turn to add-on packages for tasks like data manipulation, visualization, and collecting web data.
add-on packages
In R, add-on packages are not included in the default installation of R, but can be installed separately to extend the functionality of the language. These packages are typically created by third-party developers, and are hosted on the CRAN or other repositories.
For instance, tidyquant and quantmod both support financial data analysis. But neither comes with the base R. They are add-on packages that R users have developed to accomplish very specific tasks.
Tips for choosing packages:
- Check maintenance status. Look at the package’s last update date on CRAN or its GitHub repository.
- See what your community uses. Popularity can be gauged through colleagues, online forums, or sites like RDocumentation.
package documentation
On a package’s CRAN landing page, we’ll find its official documentations. Under the section Documentation, we’ll find Reference manual and Vignettes.
A reference manual is a definitive guide where each function of a package is documented, written by the developers.
A vignette is framed around a target problem that the package is designed to solve. For instance, the package ggplot2 provides a number of vignettes, including Introduction to ggplot2, Using ggplot2 in packages, and Aesthetic specifications, each dealing with a specific task.
Under the section Downloads, we’ll find Package source and Old sources. Old sources is an archive where we can download packages from previous versions.
installing packages
Users can install packages from multiple places, including CRAN, GitHub, BitBucket, Bioconductor (genomics), and rForge.
To install packages from CRAN repositories, we use the function install.packages(). Make sure we are connected to the Internet. Put the package name of the package in quotes.
install.packages("tidyverse")
GitHub is where much of the open-source development of R packages takes place. From GitHub, we can install development versions of packages that have a stable version on CRAN as well as packages not submitted to CRAN yet.
We need to install the package remotes from CRAN before using them.
install.packages("remotes")
remotes::install_github("tidyverse/dplyr")
pak is now the widely recommended modern approach to R package installation. It installs packages from CRAN, Bioconductor, GitHub, URLs, git repositories, and local files with faster parallel downloads, a dependency conflict solver, and disk caching.
R-universe hosts personal and organizational package repositories in a CRAN-like format, making it easy to install packages that may not be on CRAN yet.
We can install different versions of packages. For instance, if we need to install an older version of a package so that it works in an earlier version of R, we can download the package from its CRAN archive.
path_to_file <- "https://cran.r-project.org/src/contrib/Archive/nanotime/nanotime_0.3.2.tar.gz"
install.packages(path_to_file, repos = NULL, type = "source")
updating packages
To view all installed packages, use the function library() with no arguments. It prints a list of installed packages in a new window.
To update the installed packages, use update.packages().
loading packages
Before using a package’s functions, we must load it into memory with library().
library(tidyverse)
Libraries are directories containing installed packages. As end users of R, we typically interact with installed packages that live in libraries.
To see which packages are currently loaded, use search() with no arguments.
To unload a package that is currently loaded, use detach().
detach(package:tidyverse)
3.7 Getting help
help() is the primary interface to R’s help systems. It displays the documentation for a function. ? is the shortcut for help().
help(mean)
?mean
For a feature specified by special characters, the argument must be enclosed in double or single quotes.
help("if")
help("function")
help("[[")
example() runs an Examples section from the online help.
##
## median> median(1:4) # = 2.5 [even number]
## [1] 2.5
##
## median> median(c(1:3, 100, 1000)) # = 3 [odd, robust]
## [1] 3
help.search() searches all the installed packages to find help pages on a vague topic.
help.search("state space")
CRAN Task Views
If we simply have a generic interest on a topic, CRAN Task Views provides some guidance on which R packages on CRAN are relevant for tasks related to a certain topic. This page gives a brief overview of the included packages and links to the packages.
Just to give an idea of the level and range of the topics, the tasks include Bayesian Inference, Causal Inference, Databases with R, Empirical Finance, Natural Language Processing, Web Technologies and Services, and many more.
Stack Overflow
For troubleshooting, a good place to ask questions is the forum Stack Overflow, which is a searchable Q&A site oriented toward programming issues.
search engines and chat bots
Google has long been our best friend for finding answers to specific questions or tasks. It still plays that role well.
But now we have additional allies. Generative AI tools, often in the form of chat‑based assistants, have become particularly helpful as our personalized learning and coding assistants. They can explain code snippets, suggest improvements, help debug issues, translate code between languages, or generate starter examples for new projects.