When you're dealing with data that you want to explore, analyze, and visualize, you have a lot of options. Most programming languages have packages and libraries that add the capabilities you might need, but there's one ecosystem that stands out: R.
Built over the course of the last 28 years, R is a programming language made for statistical computing and data visualization. It's taught in academia, where I picked it up in my statistics course. All of these influences have shaped R to have some special properties, which might seem foreign coming from other language ecosystems.
In this guide, I want to give a thorough introduction to R, based on my experience as a student and software engineer. There are many great resources, some of which I'll refer to in the following.
Why R?
I think it depends on your use case. Most languages either offer built-in functionality to work with your data or libraries to complement the language base. If your calculations are simple, or you really want to work in the browser environment, R won't be a good fit.
If you can export the data you want to work with as CSV and just want to start exploring and visualize along the way, R is perfect. If you need to create regression models to forecast trends or apply statistical tests, R is perfect. The more in-depth it gets, really, R is perfect.
Another reason for choosing R is its stable and mature language and ecosystem. It's been here for decades and it's still actively used by a huge community of researches, data scientists, statisticians, and more.
Setting up the environment
First, you'll have to install R (brew install r
/ other platforms). Once that is done, there are multiple ways to work with R. For the following section, we'll use the interactive prompt, which is similar to REPL environments of other languages you might know.
For more complex projects, though, I'd heavily recommend going with an integrated development environment (IDE). If you do not have existing preferences, RStudio is a great open-source tool. I'll use the R Language plugin for IntelliJ, which is supported in every IntelliJ-like environment.
Using integrated environments makes it much easier to write R, manage dependencies, and preview and output graphics.
For now, let's open a terminal and run
β― R
R version 4.1.1 (2021-08-10) -- "Kick Things"
...
>
With this set up, we've got everything we need for experimentation.
To exit this environment, later on, we can write q()
, which will offer to save the current R session. You can either confirm with y
, or quit without saving with n
, then confirm by pressing return.
Basics
While the command-line interface is quite minimalistic, the workflow will usually be similar, even if you work with an IDE. Working with R is different from what you might know from other languages, usually, you start by retrieving the data source you'll work with, then perform some operations on it (cleanup, analysis, etc.), then visualize it. This all happens within one session.
If you need utilities that are not contained in the language base
, you can extend your project by installing packages from a repository. We'll explore packages later on.
Syntax
Here's a couple of important notes on language syntax.
- R is case sensitive and allows expressions to contain alphanumeric symbols, as well as dots
.
and underscores_
. - Commands can be separated with semicolons
;
or by a newline. - Comments are written starting with a
#
- Assignments are written as
name <- value
. Alternatively,name = value
works in most environments too, but in most cases, you'll see the first variant being used.
With that, let's try some simple things
Vectors and Vector Operations
The most simple structure R provides is the vector, an ordered collection of numbers. A single number is also a vector with length 1.
Entering an arbitrary number and pressing return shows you the value of your calculation, as a vector of length 1. Just entering a number doesn't make much sense, of course.
> 1
[1] 1
To create a vector with more than one item, you can use the c(...)
function which will combine all arguments passed to a vector.
> c(1,2)
[1] 1 2
R supports a range of basic arithmetic operations on vectors
# Simple operations include +,-,/,*,^
> 1+1
[1] 2
This also works when your vector has multiple items
> c(1,2) * 2
[1] 2 4
Or even when you have two vectors, which will be calculated element per element.
If vectors are not the same length, the result's length will match the longest vector, and the shorter vector will be recycled until they match the length of the longest vector.
Constants are just repeated for each element as we saw in the example above.
> c(1,2) * c(3,4) # equals c(1*3,2*4)
[1] 3 8
In addition to the operators, you can use the functions log
, exp
, sin
, cos
, tan
, sqrt
, min
, max
, etc.
If you want to add up all elements of a vector or get the product of all values, you can use sum
and prod
> sum(c(1,2))
[1] 3
> prod(c(2,3))
[1] 6
If you want to know how many elements are contained in a vector, use length
> length(c(3,5,6,2,5))
[1] 5
Variables
Now that we have a basic understanding of what we can do with vectors, it would be nice to store our results. For this, we can assign our values to a name, which is persisted.
> x <- 5
> y <- x*2
> y
[1] 10
With this, you can get creative
> x <- 5
> c(x,c(1,2))
[1] 5 1 2
Missing Value
If you run into a case where a value is unknown or missing, you can use NA
(not available). Any operation on a missing (NA
) value becomes NA
.
> c(1,NA,3)
[1] 1 NA 3
> c(1,NA,3) * 2
[1] 2 NA 6
In addition to NA, there is another missing value, which results from illegal operations: NaN
(not a number).
> 0/0
[1] NaN
> NaN*2
[1] NaN
To detect missing values, you can use is.na(x)
(true for both NA and NaN) and is.nan(x)
(only true for NaN)
Sequences
You can create numeric sequences using the colon syntax
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1
If you need more granular control over generating the sequence, use the seq
function
> seq(1,10) # identical to 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> seq(1,10, by = 3)
[1] 1 4 7 10
> seq(1,5, by = .5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Logical Vectors
In addition to regular vector operations with numerical vectors, you can create logical vectors with conditions.
> x <- 5
> y <- 5 > 3
> y
[1] TRUE
As with numerical vectors, you can add conditions to a vector with multiple elements, resulting in a vector of results of the condition applied to each element.
> x <- c(2,4) > 3
> x
[1] FALSE TRUE
Index Vectors
When you want to work with subsets of elements of a vector, you can select specific values by appending an index vector in square brackets
x <- c(1,2,3,4,5)
> x[1]
[1] 1
> x[1:3]
[1] 1 2 3
Not only can you use an integer value as an index vector, but also a logical vector (i.e. a vector of TRUE
/FALSE
elements), which will be recycled in case the vector from which elements are to be selected is longer. Values corresponding to TRUE
in the index vector will be selected, all others are omitted.
x <- c(1,2,3,4,5)
# Keep elements >= 3
> x[x >= 3]
[1] 3 4 5
# Keep even numbers
> x[x %% 2 == 0]
[1] 2 4
Functions
In case you keep repeating a specific piece of logic, you might want to move that code into a reusable function. A function receives a list of arguments (or parameters) and returns a value, either implicitly (the last evaluated value) or
> percChange <- function (old,new) { (new-old)/(old) }
> percChange(10,15) # same as percChange(old = 10, new = 15)
[1] 0.5
> percChange(15,10)
[1] -0.3333333
When invoking a function, argument names can be omitted or explicitly included for readability. When passing named arguments, the order becomes irrelevant.
If you have arguments that your function will simply pass on, you can use the ...
syntax
doSomething <- function(x, y, ...) {
z <- x + y
otherFunction(z, ...)
}
In this example, we declare a function that receives the named arguments x and y used for internal computation, and an additional argument ...
that collects all values passed additionally.
doSomething(x, y) # ... will be empty
doSomething(x, y, z) # ... will include z
If you want to find out what was passed in ...
, you can call list(...)
to get a named list of all arguments.
Binary Operators
A syntactical help for operations is to declare functions as binary operators, in the for %x% where x is the name of the operator. Let's implement one to see what would happen
> "%product%" <- function(x,y) { x * y }
> 5 %product% 2
[1] 10
Note the double quotes when declaring the variable name the function is assigned to.
Control Flow
In functions, you might want to run some logic conditionally. This can be done using if statements. Let's implement a simple function, the absolute value, which removes the sign of negative numerical values. This is included as abs
but we'll demonstrate how you could implement it manually in the following
absoluteValue <- function(x) {
if (x < 0) {
-x
} else {
x
}
}
> absoluteValue(-3)
[1] 3
> absoluteValue(3)
[1] 3
You can also use the vectorized version ifelse
:
absoluteValue <- function(x) { ifelse(x < 0, -x, x) }
Objects
All entities R manages are objects, no matter whether you're working with numbers, strings, or other structures. These objects will be stored in your session by name, to be reused. You can list all objects by typing objects()
.
You can remove existing objects from your session by passing their name to rm()
# At the beginning of our session, no objects are stored
> objects()
character(0)
# When we assign a character string to a variable name,
# the resulting variable is stored in our session
> hello <- "world"
> objects()
[1] "hello"
# We can now access this variable by its name
> hello
[1] "world"
# And remove it from the session with rm(...)
> rm(hello)
# And it's gone
> objects()
character(0)
Lists
Lists represent an ordered collection of objects, referred to as components.
Components can be of different types, and are always numbered. They can be named, which makes it easier to retrieve data with the $
operator.
> user <- list(id = 1, name = "Bruno", favorite.color="#0200FF")
> user[[1]]
[1] 1
> user[[2]]
[1] "Bruno"
> user$name
[1] "Bruno"
> user[["name"]]
[1] "Bruno"
When retrieving component values by name using the $
operator, you can even abbreviate the name until it is unique.
> sample <- list(is_verified=TRUE, is_admin=FALSE)
> sample$is_a # same as sample$is_admin
[1] FALSE
> sample$is_v
[1] TRUE
Data Frames
Data frames are extensions of lists, which you can think of as a matrix with columns possibly of differing modes and attributes.
Usually, you do not manually construct data frames but load them from data sources (CSVs, etc.). For this example, we'll create a frame with two columns and two rows of values.
> users <- data.frame(id = c(1,2), name = c("Ada", "Bob"))
> users
id name
1 1 Ada
2 2 Bob
> users$id
[1] 1 2
> users$name
[1] "Ada" "Bob"
Packages
Packages are not libraries: Libraries contain a set of packages.
To extend the functions or data you can use within R, you can install packages from a repository, usually CRAN (the Comprehensive R Archive Network).
As an example, let's install the package MASS
to get access to a data set included
> install.packages("MASS")
...
> library(MASS)
To view all available datasets in the current environment, run data()
Data sets in package βdatasetsβ:
...
Data sets in package βMASSβ:
...
phones Belgium Phone Calls 1950-1973
...
Let's explore the phones
dataset!
> phones
$year
[1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
$calls
[1] 4.4 4.7 4.7 5.9 6.6 7.3 8.1 8.8 10.6 12.0 13.5 14.9
[13] 16.1 21.2 119.0 124.0 142.0 159.0 182.0 212.0 43.0 24.0 27.0 29.0
Resources
- R Intro: Introduction to the R language, different data types, and more
- Awesome R: List of packages and tools for all use cases
Thanks for reading! This post deals with the language fundamentals you should be aware of, to read and write R code. Up next, I'll continue with graphics and other applications where R is incredibly useful.