Think about data structures as different kinds of containers for storing data values.
We’ve met vectors already. Vectors are the most important type of object in R. Vectors contain a single type of value: numbers, strings, or logical values. But there are several others that are more complicated than vectors. Each defines how an object is stored in R.
To begin with, factors. Think about factors as vectors with categorical labels.
Then, matrices and arrays. A matrix is an extension of a vector to two dimensions. An array is a multidimensional vector.
Next, we have lists. Lists are a general form of vector in which the various elements need not be of the same type. Lists can contain other objects, such as vectors, lists and data frames
Finally, for this post (but not for R), data frames. Data frames are matrix-like structures, in which the columns can be of different types. Think about data frames as “data matrices” with one row per observational unit.
The data structures R operates on are called objects. Common types of objects include vectors, factors, arrays, matrices, lists, data frames, and functions. Below we’ll go through each type of data structure.
Vectors contain ordered numbers, or sequences, and it can only contain objects of the same class.
Vector is the most important object in R as much of R is “vectorized”. This means that a function works on a whole vector, and there is no need to loop over all values of the vector.
Vectors of different lengths can appear in the same expression; but the shorter one will be recycled until it matches the longer one. We will receive a warning if the length of the longer object is not a multiple of the length of the shorter object.
There are several ways to generate a vector of sequences using the :
operator and functions c()
, seq()
, rep()
and paste()
.
Read the full post on vectors.
A factor is a vector used to specify a discrete classification (grouping) of the components of other vectors of the same length. (W. N. Venables, D. M. Smith and the R Core Team. (2021). An Introduction to R: Notes on R: A Programming Environment for Data Analysis and Graphics.)
We use factors to represent a categorical variable (e.g. in linear regression, logistic regression) and to label data items according to their group.
We can create a factor with the function factor()
.
flavor <- c("chocolate", "vanilla", "strawberry", "mint",
"coffee", "strawberry", "vanilla", "pistachio")
flavor_f <- factor(flavor)
flavor_f
## [1] chocolate vanilla strawberry mint coffee strawberry vanilla
## [8] pistachio
## Levels: chocolate coffee mint pistachio strawberry vanilla
A factor has an attribute called levels. Levels are the different values that a factor can take.
attributes(flavor_f)
## $levels
## [1] "chocolate" "coffee" "mint" "pistachio" "strawberry"
## [6] "vanilla"
##
## $class
## [1] "factor"
Use levels()
to get the levels of a factor.
levels(flavor_f)
## [1] "chocolate" "coffee" "mint" "pistachio" "strawberry"
## [6] "vanilla"
nlevels()
returns the number of levels of a factor.
nlevels(flavor_f)
## [1] 6
We can manually set the order of the levels. Use levels
argument in the function factor()
to specify the levels. Use ordered
argument to determine if the levels should be regarded as ordered in the order given. By default, the levels are stored in alphabetical order.
factor(flavor)
## [1] chocolate vanilla strawberry mint coffee strawberry vanilla
## [8] pistachio
## Levels: chocolate coffee mint pistachio strawberry vanilla
factor(flavor, levels = c("strawberry", "vanilla", "chocalate", "coffee", "mint", "pistachio"))
## [1] <NA> vanilla strawberry mint coffee strawberry vanilla
## [8] pistachio
## Levels: strawberry vanilla chocalate coffee mint pistachio
factor(flavor, levels = c("strawberry", "vanilla", "chocalate", "coffee", "mint", "pistachio"),
ordered = TRUE)
## [1] <NA> vanilla strawberry mint coffee strawberry vanilla
## [8] pistachio
## Levels: strawberry < vanilla < chocalate < coffee < mint < pistachio
A more meaningful example is when the order actually matters. For example, we conducted a survey and asked respondents how they felt about the statement “sweet rice dumplings are better than salty dumplings.” Respondents gave one of the following responses: Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree.
survey_results <- factor(
c("Disagree", "Neutral", "Strongly Disagree",
"Neutral", "Agree", "Strongly Agree",
"Disagree", "Strongly Agree", "Neutral",
"Strongly Disagree", "Neutral", "Agree"),
levels = c("Strongly Disagree", "Disagree",
"Neutral", "Agree", "Strongly Agree"),
ordered = TRUE)
survey_results
## [1] Disagree Neutral Strongly Disagree Neutral
## [5] Agree Strongly Agree Disagree Strongly Agree
## [9] Neutral Strongly Disagree Neutral Agree
## 5 Levels: Strongly Disagree < Disagree < Neutral < ... < Strongly Agree
Factors are useful in running regressions that have categorical variables with orders.
library(dplyr)
storms <- storms
class(storms$category)
## [1] "ordered" "factor"
levels(storms$category)
## [1] "-1" "0" "1" "2" "3" "4" "5"
summary(lm(wind ~ category, data = storms))
##
## Call:
## lm(formula = wind ~ category, data = storms)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.2692 -5.8004 -0.8004 4.0890 14.9265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86.3832 0.1399 617.486 < 2e-16 ***
## category.L 101.7947 0.4737 214.875 < 2e-16 ***
## category.Q -2.4750 0.4622 -5.355 8.76e-08 ***
## category.C 3.3975 0.3858 8.807 < 2e-16 ***
## category^4 4.6516 0.3063 15.188 < 2e-16 ***
## category^5 -1.8054 0.2706 -6.672 2.66e-11 ***
## category^6 0.4220 0.2608 1.618 0.106
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.437 on 10003 degrees of freedom
## Multiple R-squared: 0.9397, Adjusted R-squared: 0.9397
## F-statistic: 2.6e+04 on 6 and 10003 DF, p-value: < 2.2e-16
Factors are also useful in graphs to reorder levels of a variable.
library(ggplot2)
ggplot(dtset, aes(x = reorder(Reason, Total), y = Total, fill = factor(Level, levels = c("High","Medium","Low")))) +
geom_bar(stat = "identity", alpha = 0.75) +
scale_fill_manual(values = c("#765285", "#709FB0", "#D1A827"), name="Level of\nFrequency") +
coord_flip() +
theme(axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12, margin = margin(0,3,0,0)),
axis.title.y = element_blank(),
axis.title.x = element_text(size = 12, margin = margin(15,0,0,0)),
axis.ticks.x = element_line(size = 0),
legend.title = element_text(size = 12),
legend.text = element_text(size = 12),
plot.margin = unit(c(0,0,1,0), "cm"))
A factor can be converted to character using as.character()
.
f <- factor(c("chocolate", "vanilla", "strawberry"))
f2 <- as.character(f)
class(f2)
## [1] "character"
A factor can be converted to numeric using as.numeric(as.character())
.
category <- storms$category
category2 <- as.numeric(as.character(category))
class(category2)
## [1] "numeric"
Or through as.numeric(levels())[]
.
category3 <- as.numeric(levels(category))[category]
class(category3)
## [1] "numeric"
An array is a multidimensional vector. A matrix is a special type of array that has two dimensions.
A matrix is an extension of a vector to two dimensions. Just to show what that means:
a <- 1:6
dim(a) #initially NULL
## NULL
dim(a) <- c(2,3)
a
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
In real life, let’s use the matrix()
function to generate a new matrix and specify the numbers of rows and columns.
matrix(data = 1:6, nrow = 2, ncol = 3)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
Note: A matrix stores data of a single type.
By default data are filled by columns unless specified otherwise.
matrix(data = 1:6, nrow = 2, ncol = 3, byrow = TRUE)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
We can give the row and column names by specifying the dimnames
argument.
matrix(data = 1:6, nrow = 2, ncol = 3, byrow = TRUE, dimnames = list(c("r1","r2"), c("c1","c2","c3")))
## c1 c2 c3
## r1 1 2 3
## r2 4 5 6
We refer to part of a matrix using the indexing operator []
that we’ve seen before.
a[2,2] #second row and second column
## [1] 4
a[1:2,1:2] #first two rows and first two columns
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
a[1,] #first row only
## [1] 1 3 5
a[,1] #first column only
## [1] 1 2
cbind()
and rbind()
combine matrices together by binding columns and rows.
m1 <- matrix(1:9, ncol = 3, nrow = 3)
m2 <- matrix(10:12, ncol =1, nrow = 3)
m3 <- matrix(10:12, ncol = 3, nrow = 1)
m1
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
m2
## [,1]
## [1,] 10
## [2,] 11
## [3,] 12
m3
## [,1] [,2] [,3]
## [1,] 10 11 12
cbind(m1, m2)
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
rbind(m1, m3)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [4,] 10 11 12
R offers rich matrix operators.
For instance, matrix addition:
A <- matrix(c(1:12), 3, 4)
B <- matrix(c(13:24), 3, 4)
A + B
## [,1] [,2] [,3] [,4]
## [1,] 14 20 26 32
## [2,] 16 22 28 34
## [3,] 18 24 30 36
Matrix multiplication:
A * B
## [,1] [,2] [,3] [,4]
## [1,] 13 64 133 220
## [2,] 28 85 160 253
## [3,] 45 108 189 288
Transposition:
t(A)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [4,] 10 11 12
A matrix is a special, two-dimensional array. An array is a multidimensional vector. Vectors and arrays are stored the same way internally.
b <- 1:12
dim(b) <- c(2,3,2)
b
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
A more natural way to create an array is to use the function array()
.
b <- array(1:12, dim = c(2,3,2))
b
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
attributes(b)
## $dim
## [1] 2 3 2
dim(b)
## [1] 2 3 2
A list is a vector where each element can be of a different data type.
list()
creates a list. List components can be named.
book <- list(title = "Nineteen Eighty-Four: A Novel",
author = "George Orwell",
published_year = 1949,
pages = 328)
book
## $title
## [1] "Nineteen Eighty-Four: A Novel"
##
## $author
## [1] "George Orwell"
##
## $published_year
## [1] 1949
##
## $pages
## [1] 328
Lists can be indexed by position or name.
By position. Same as in vector indexing.
book[3]
## $published_year
## [1] 1949
book[-3]
## $title
## [1] "Nineteen Eighty-Four: A Novel"
##
## $author
## [1] "George Orwell"
##
## $pages
## [1] 328
book[[3]]
## [1] 1949
book[c(2,3)]
## $author
## [1] "George Orwell"
##
## $published_year
## [1] 1949
By name, using $
or [[""]]
. With $
, R accepts partial matching of element names.
book$title
## [1] "Nineteen Eighty-Four: A Novel"
book$t
## [1] "Nineteen Eighty-Four: A Novel"
book[["title"]]
## [1] "Nineteen Eighty-Four: A Novel"
book[c("title", "author")]
## $title
## [1] "Nineteen Eighty-Four: A Novel"
##
## $author
## [1] "George Orwell"
Note: With []
, the result of these indexing operations is another list. If we want to access the contents of the list, we should use the double brackets [[]]
operator or the dollar sign $
operator for the named components.
A list can contain other lists. This makes the list a recursive object in R.
books <- list("this list references another list", book)
books
## [[1]]
## [1] "this list references another list"
##
## [[2]]
## [[2]]$title
## [1] "Nineteen Eighty-Four: A Novel"
##
## [[2]]$author
## [1] "George Orwell"
##
## [[2]]$published_year
## [1] 1949
##
## [[2]]$pages
## [1] 328
To access nested elements, we can stack up the square brackets.
books[[2]][["pages"]]
## [1] 328
unlist()
unlist()
can be used to flatten a list to a vector.
unlist(books)
## title
## "this list references another list" "Nineteen Eighty-Four: A Novel"
## author published_year
## "George Orwell" "1949"
## pages
## "328"
A data frame is a list with class data.frame
. Data frames are used to store spreadsheet-like data. It has rows and columns. Each column can store data of a different type and is the same length. The columns must have names. The components of the data frame are vectors, factors, numeric matrices, lists, or other data frames.
Data frames are particularly good for representing observational data.
Data frames can be created by data.frame()
.
laureate <- c("Bob Dylan", "Mo Yan", "Ernest Hemingway", "Winston Churchill", "Bertrand Russell")
year <- c(2016, 2012, 1954, 1953, 1950)
country <- c("United States", "China", "United States", "United Kingdom", "United Kingdom")
genre <- c("poetry, songwriting", "novel, short story", "novel, short story, screenplay", "history, essay, memoirs", "philosophy")
nobel_prize_literature <- data.frame(laureate, year, country, genre)
nobel_prize_literature
## laureate year country genre
## 1 Bob Dylan 2016 United States poetry, songwriting
## 2 Mo Yan 2012 China novel, short story
## 3 Ernest Hemingway 1954 United States novel, short story, screenplay
## 4 Winston Churchill 1953 United Kingdom history, essay, memoirs
## 5 Bertrand Russell 1950 United Kingdom philosophy
Note: A data frame is not a matrix; it is a list interpreted as a data frame.
mode(nobel_prize_literature)
## [1] "list"
class(nobel_prize_literature)
## [1] "data.frame"
We can refer to the components of a data frame by name using the list operators $
or [[]]
.
nobel_prize_literature$laureate
## [1] "Bob Dylan" "Mo Yan" "Ernest Hemingway"
## [4] "Winston Churchill" "Bertrand Russell"
nobel_prize_literature[["laureate"]]
## [1] "Bob Dylan" "Mo Yan" "Ernest Hemingway"
## [4] "Winston Churchill" "Bertrand Russell"
Or using matrix-like notations.
nobel_prize_literature[1,]
## laureate year country genre
## 1 Bob Dylan 2016 United States poetry, songwriting
Logical conditions are allowed, and actually frequently used.
nobel_prize_literature$laureate[nobel_prize_literature$country == "United Kingdom"]
## [1] "Winston Churchill" "Bertrand Russell"
Data frame attributes include row names rownames()
, column names colnames()
, dimension names dimnames()
, number of rows nrow()
, number of columns ncol()
, dimensions dim()
etc.
rownames(nobel_prize_literature)
## [1] "1" "2" "3" "4" "5"
colnames(nobel_prize_literature)
## [1] "laureate" "year" "country" "genre"
dimnames(nobel_prize_literature)
## [[1]]
## [1] "1" "2" "3" "4" "5"
##
## [[2]]
## [1] "laureate" "year" "country" "genre"
nrow(nobel_prize_literature)
## [1] 5
ncol(nobel_prize_literature)
## [1] 4
dim(nobel_prize_literature)
## [1] 5 4