Data Structures

Think about data structures as different kinds of containers for storing data values.

We’ve met vectors already. Vectors are the most important type of object in R. Vectors contain a single type of value: numbers, strings, or logical values. But there are several others that are more complicated than vectors. Each defines how an object is stored in R.

To begin with, factors. Think about factors as vectors with categorical labels.

Then, matrices and arrays. A matrix is an extension of a vector to two dimensions. An array is a multidimensional vector.

Next, we have lists. Lists are a general form of vector in which the various elements need not be of the same type. Lists can contain other objects, such as vectors, lists and data frames

Finally, for this post (but not for R), data frames. Data frames are matrix-like structures, in which the columns can be of different types. Think about data frames as “data matrices” with one row per observational unit.

The data structures R operates on are called objects. Common types of objects include vectors, factors, arrays, matrices, lists, data frames, and functions. Below we’ll go through each type of data structure.

Vectors

Vectors contain ordered numbers, or sequences, and it can only contain objects of the same class.

Vector is the most important object in R as much of R is “vectorized”. This means that a function works on a whole vector, and there is no need to loop over all values of the vector.

Vectors of different lengths can appear in the same expression; but the shorter one will be recycled until it matches the longer one. We will receive a warning if the length of the longer object is not a multiple of the length of the shorter object.

There are several ways to generate a vector of sequences using the : operator and functions c(), seq(), rep() and paste().

Read the full post on vectors.

Factors

A factor is a vector used to specify a discrete classification (grouping) of the components of other vectors of the same length. (W. N. Venables, D. M. Smith and the R Core Team. (2021). An Introduction to R: Notes on R: A Programming Environment for Data Analysis and Graphics.)

We use factors to represent a categorical variable (e.g. in linear regression, logistic regression) and to label data items according to their group.

We can create a factor with the function factor().

flavor <- c("chocolate", "vanilla", "strawberry", "mint", 
            "coffee", "strawberry", "vanilla", "pistachio")
flavor_f <- factor(flavor)
flavor_f

## [1] chocolate  vanilla    strawberry mint       coffee     strawberry vanilla   
## [8] pistachio 
## Levels: chocolate coffee mint pistachio strawberry vanilla

A factor has an attribute called levels. Levels are the different values that a factor can take.

attributes(flavor_f)

## $levels
## [1] "chocolate"  "coffee"     "mint"       "pistachio"  "strawberry"
## [6] "vanilla"   
## 
## $class
## [1] "factor"

Use levels() to get the levels of a factor.

levels(flavor_f)

## [1] "chocolate"  "coffee"     "mint"       "pistachio"  "strawberry"
## [6] "vanilla"

nlevels() returns the number of levels of a factor.

nlevels(flavor_f)

## [1] 6

We can manually set the order of the levels. Use levels argument in the function factor() to specify the levels. Use ordered argument to determine if the levels should be regarded as ordered in the order given. By default, the levels are stored in alphabetical order.

factor(flavor)

## [1] chocolate  vanilla    strawberry mint       coffee     strawberry vanilla   
## [8] pistachio 
## Levels: chocolate coffee mint pistachio strawberry vanilla

factor(flavor, levels = c("strawberry", "vanilla", "chocalate", "coffee", "mint", "pistachio"))

## [1] <NA>       vanilla    strawberry mint       coffee     strawberry vanilla   
## [8] pistachio 
## Levels: strawberry vanilla chocalate coffee mint pistachio

factor(flavor, levels = c("strawberry", "vanilla", "chocalate", "coffee", "mint", "pistachio"),
       ordered = TRUE)

## [1] <NA>       vanilla    strawberry mint       coffee     strawberry vanilla   
## [8] pistachio 
## Levels: strawberry < vanilla < chocalate < coffee < mint < pistachio

A more meaningful example is when the order actually matters. For example, we conducted a survey and asked respondents how they felt about the statement “sweet rice dumplings are better than salty dumplings.” Respondents gave one of the following responses: Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree.

survey_results <- factor(
  c("Disagree", "Neutral", "Strongly Disagree",
  "Neutral", "Agree", "Strongly Agree",
  "Disagree", "Strongly Agree", "Neutral",
  "Strongly Disagree", "Neutral", "Agree"),
  levels = c("Strongly Disagree", "Disagree",
  "Neutral", "Agree", "Strongly Agree"),
  ordered = TRUE)

survey_results

##  [1] Disagree          Neutral           Strongly Disagree Neutral          
##  [5] Agree             Strongly Agree    Disagree          Strongly Agree   
##  [9] Neutral           Strongly Disagree Neutral           Agree            
## 5 Levels: Strongly Disagree < Disagree < Neutral < ... < Strongly Agree

using factors

Factors are useful in running regressions that have categorical variables with orders.

library(dplyr)
storms <- storms
class(storms$category)

## [1] "ordered" "factor"

levels(storms$category)

## [1] "-1" "0"  "1"  "2"  "3"  "4"  "5"

summary(lm(wind ~ category, data = storms))

## 
## Call:
## lm(formula = wind ~ category, data = storms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.2692  -5.8004  -0.8004   4.0890  14.9265 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  86.3832     0.1399 617.486  < 2e-16 ***
## category.L  101.7947     0.4737 214.875  < 2e-16 ***
## category.Q   -2.4750     0.4622  -5.355 8.76e-08 ***
## category.C    3.3975     0.3858   8.807  < 2e-16 ***
## category^4    4.6516     0.3063  15.188  < 2e-16 ***
## category^5   -1.8054     0.2706  -6.672 2.66e-11 ***
## category^6    0.4220     0.2608   1.618    0.106    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.437 on 10003 degrees of freedom
## Multiple R-squared:  0.9397, Adjusted R-squared:  0.9397 
## F-statistic: 2.6e+04 on 6 and 10003 DF,  p-value: < 2.2e-16

Factors are also useful in graphs to reorder levels of a variable.

library(ggplot2)

ggplot(dtset, aes(x = reorder(Reason, Total), y = Total, fill = factor(Level, levels = c("High","Medium","Low")))) + 
  geom_bar(stat = "identity", alpha = 0.75) + 
  scale_fill_manual(values = c("#765285", "#709FB0", "#D1A827"), name="Level of\nFrequency") +
  coord_flip() +
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12, margin = margin(0,3,0,0)),
        axis.title.y = element_blank(),
        axis.title.x = element_text(size = 12, margin = margin(15,0,0,0)),
        axis.ticks.x = element_line(size = 0),
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 12),
        plot.margin = unit(c(0,0,1,0), "cm"))

converting factors

A factor can be converted to character using as.character().

f <- factor(c("chocolate", "vanilla", "strawberry"))
f2 <- as.character(f)
class(f2)

## [1] "character"

A factor can be converted to numeric using as.numeric(as.character()).

category <- storms$category
category2 <- as.numeric(as.character(category))
class(category2)

## [1] "numeric"

Or through as.numeric(levels())[].

category3 <- as.numeric(levels(category))[category]
class(category3)

## [1] "numeric"

Matrices and Arrays

An array is a multidimensional vector. A matrix is a special type of array that has two dimensions.

matrices

A matrix is an extension of a vector to two dimensions. Just to show what that means:

a <- 1:6
dim(a) #initially NULL

## NULL

dim(a) <- c(2,3)
a

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

In real life, let’s use the matrix() function to generate a new matrix and specify the numbers of rows and columns.

matrix(data = 1:6, nrow = 2, ncol = 3)

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Note: A matrix stores data of a single type.

By default data are filled by columns unless specified otherwise.

matrix(data = 1:6, nrow = 2, ncol = 3, byrow = TRUE)

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

We can give the row and column names by specifying the dimnames argument.

matrix(data = 1:6, nrow = 2, ncol = 3, byrow = TRUE, dimnames = list(c("r1","r2"), c("c1","c2","c3")))

##    c1 c2 c3
## r1  1  2  3
## r2  4  5  6

We refer to part of a matrix using the indexing operator [] that we’ve seen before.

a[2,2] #second row and second column

## [1] 4

a[1:2,1:2] #first two rows and first two columns

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

a[1,] #first row only

## [1] 1 3 5

a[,1] #first column only

## [1] 1 2

cbind() and rbind() combine matrices together by binding columns and rows.

m1 <- matrix(1:9, ncol = 3, nrow = 3) 
m2 <- matrix(10:12, ncol =1, nrow = 3)
m3 <- matrix(10:12, ncol = 3, nrow = 1)

m1

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

m2

##      [,1]
## [1,]   10
## [2,]   11
## [3,]   12

m3

##      [,1] [,2] [,3]
## [1,]   10   11   12

cbind(m1, m2)

##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

rbind(m1, m3)

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## [4,]   10   11   12

matrix operations

R offers rich matrix operators.

For instance, matrix addition:

A <- matrix(c(1:12), 3, 4) 
B <- matrix(c(13:24), 3, 4) 
A + B

##      [,1] [,2] [,3] [,4]
## [1,]   14   20   26   32
## [2,]   16   22   28   34
## [3,]   18   24   30   36

Matrix multiplication:

A * B

##      [,1] [,2] [,3] [,4]
## [1,]   13   64  133  220
## [2,]   28   85  160  253
## [3,]   45  108  189  288

Transposition:

t(A)

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
## [4,]   10   11   12

arrays

A matrix is a special, two-dimensional array. An array is a multidimensional vector. Vectors and arrays are stored the same way internally.

b <- 1:12
dim(b) <- c(2,3,2)
b

## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

A more natural way to create an array is to use the function array().

b <- array(1:12, dim = c(2,3,2))
b

## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

attributes(b)

## $dim
## [1] 2 3 2

dim(b)

## [1] 2 3 2

Lists

A list is a vector where each element can be of a different data type.

list() creates a list. List components can be named.

book <- list(title = "Nineteen Eighty-Four: A Novel", 
             author = "George Orwell", 
             published_year = 1949, 
             pages = 328)
book

## $title
## [1] "Nineteen Eighty-Four: A Novel"
## 
## $author
## [1] "George Orwell"
## 
## $published_year
## [1] 1949
## 
## $pages
## [1] 328

list indexing

Lists can be indexed by position or name.

By position. Same as in vector indexing.

book[3]

## $published_year
## [1] 1949

book[-3]

## $title
## [1] "Nineteen Eighty-Four: A Novel"
## 
## $author
## [1] "George Orwell"
## 
## $pages
## [1] 328

book[[3]]

## [1] 1949

book[c(2,3)]

## $author
## [1] "George Orwell"
## 
## $published_year
## [1] 1949

By name, using $ or [[""]]. With $, R accepts partial matching of element names.

book$title

## [1] "Nineteen Eighty-Four: A Novel"

book$t

## [1] "Nineteen Eighty-Four: A Novel"

book[["title"]]

## [1] "Nineteen Eighty-Four: A Novel"

book[c("title", "author")]

## $title
## [1] "Nineteen Eighty-Four: A Novel"
## 
## $author
## [1] "George Orwell"

Note: With [], the result of these indexing operations is another list. If we want to access the contents of the list, we should use the double brackets [[]] operator or the dollar sign $ operator for the named components.

a list can contain other lists

A list can contain other lists. This makes the list a recursive object in R.

books <- list("this list references another list", book)
books

## [[1]]
## [1] "this list references another list"
## 
## [[2]]
## [[2]]$title
## [1] "Nineteen Eighty-Four: A Novel"
## 
## [[2]]$author
## [1] "George Orwell"
## 
## [[2]]$published_year
## [1] 1949
## 
## [[2]]$pages
## [1] 328

To access nested elements, we can stack up the square brackets.

books[[2]][["pages"]]

## [1] 328

`unlist()`

unlist() can be used to flatten a list to a vector.

unlist(books)

##                                                                   title 
## "this list references another list"     "Nineteen Eighty-Four: A Novel" 
##                              author                      published_year 
##                     "George Orwell"                              "1949" 
##                               pages 
##                               "328"

Data frames

A data frame is a list with class data.frame. Data frames are used to store spreadsheet-like data. It has rows and columns. Each column can store data of a different type and is the same length. The columns must have names. The components of the data frame are vectors, factors, numeric matrices, lists, or other data frames.

Data frames are particularly good for representing observational data.

Data frames can be created by data.frame().

laureate <- c("Bob Dylan", "Mo Yan", "Ernest Hemingway", "Winston Churchill", "Bertrand Russell")
year <- c(2016, 2012, 1954, 1953, 1950)
country <- c("United States", "China", "United States", "United Kingdom", "United Kingdom")
genre <- c("poetry, songwriting", "novel, short story", "novel, short story, screenplay", "history, essay, memoirs", "philosophy")

nobel_prize_literature <- data.frame(laureate, year, country, genre)
nobel_prize_literature

##            laureate year        country                          genre
## 1         Bob Dylan 2016  United States            poetry, songwriting
## 2            Mo Yan 2012          China             novel, short story
## 3  Ernest Hemingway 1954  United States novel, short story, screenplay
## 4 Winston Churchill 1953 United Kingdom        history, essay, memoirs
## 5  Bertrand Russell 1950 United Kingdom                     philosophy

Note: A data frame is not a matrix; it is a list interpreted as a data frame.

mode(nobel_prize_literature)

## [1] "list"

class(nobel_prize_literature)

## [1] "data.frame"

data frame indexing

We can refer to the components of a data frame by name using the list operators $ or [[]].

nobel_prize_literature$laureate

## [1] "Bob Dylan"         "Mo Yan"            "Ernest Hemingway" 
## [4] "Winston Churchill" "Bertrand Russell"

nobel_prize_literature[["laureate"]]

## [1] "Bob Dylan"         "Mo Yan"            "Ernest Hemingway" 
## [4] "Winston Churchill" "Bertrand Russell"

Or using matrix-like notations.

nobel_prize_literature[1,]

##    laureate year       country               genre
## 1 Bob Dylan 2016 United States poetry, songwriting

Logical conditions are allowed, and actually frequently used.

nobel_prize_literature$laureate[nobel_prize_literature$country == "United Kingdom"]

## [1] "Winston Churchill" "Bertrand Russell"

data frame attributes

Data frame attributes include row names rownames(), column names colnames(), dimension names dimnames(), number of rows nrow(), number of columns ncol(), dimensions dim() etc.

rownames(nobel_prize_literature)

## [1] "1" "2" "3" "4" "5"

colnames(nobel_prize_literature)

## [1] "laureate" "year"     "country"  "genre"

dimnames(nobel_prize_literature)

## [[1]]
## [1] "1" "2" "3" "4" "5"
## 
## [[2]]
## [1] "laureate" "year"     "country"  "genre"

nrow(nobel_prize_literature)

## [1] 5

ncol(nobel_prize_literature)

## [1] 4

dim(nobel_prize_literature)

## [1] 5 4