More on Subsetting

In Vectors and Data Structures, we discussed how to index/subset vectors, matrices, arrays, lists and data frames. In this post, we will explore more subsetting techniques with a few cases.

Storing temporary outputs

extending object elements

Objects can be extended by subsetting operators.

v <- c(1,2,3) 
v[4] <- 4 
v

## [1] 1 2 3 4

This technique is useful when we need to create an object to store temporary outputs from a series of operations (e.g. in a loop).

Extending the example above, we can add elements to an empty vector one by one in a loop by extending the elements of the vector.

output <- c()

i <- 1
for (i in 1:5){
  output[i] <- i
}

output

## [1] 1 2 3 4 5

Extending the example above further, we can add elements to rows and columns of an empty data frame in a loop by extending its elements.

multiplier <- c(1:5)
output <- data.frame()

i <- 1
for (i in multiplier){
  output[i,1] <- multiplier[i]
  output[i,2] <- paste(2, "*", multiplier[i], "=", 2 * multiplier[i])
}

names(output) <- c("multiplier", "output")
output

##   multiplier     output
## 1          1  2 * 1 = 2
## 2          2  2 * 2 = 4
## 3          3  2 * 3 = 6
## 4          4  2 * 4 = 8
## 5          5 2 * 5 = 10

case 1

Now we will look at how storing temporary outputs by extending object elements could be useful.

In the example below, we use the dataset df. Each row of df describes a talk. We want to add a new column num_tweet to the existing data frame df. num_tweet would be the number of tweets mentioning one speaker of a talk, binded to each row of main_speaker in df. Before that, we need to create a new vector to store the number of tweets for each speaker.

head(df["main_speaker"])

##                        main_speaker
## 1 Aicha el-Wafi + Phyllis Rodriguez
## 2                         AJ Jacobs
## 3                    Markus Fischer
## 4                 Improv Everywhere
## 5                     Geert Chatrou
## 6                     Aakash Odedra

In each iteration of the loop, a query with the speaker name is sent to Twitter to collect all tweets mentioning the speaker. The count of the tweets for each speaker is then calculated and stored in the vector num_tweets.

library(rtweet)

## creating the query term
df$speaker <- gsub(pattern = " and |,| \\+ ", replacement = " OR ", df$main_speaker)
speaker <- df$main_speaker

## creating a new vector to store the number of tweets for each speaker 
num_tweets <- double()

for (i in 1:length(df$speaker)){
  tweets <- nrow(search_tweets(q = speaker[i], retryonratelimit = TRUE))
  num_tweets[i] <- tweets
}

df$num_tweet <- num_tweets

case 2

Using the same data, below we want to extract the polarity scores from the ratings for each talk, weight each polarity score by how many people gave the rating, and sum up all the weighted polarity scores of each talk to get a final “sentiment score”.

ratings came in a dictionary type of structure.

head(df["ratings"], 1)

##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ratings
## 1 [{'id': 10, 'name': 'Inspiring', 'count': 385}, {'id': 1, 'name': 'Beautiful', 'count': 229}, {'id': 9, 'name': 'Ingenious', 'count': 10}, {'id': 3, 'name': 'Courageous', 'count': 338}, {'id': 24, 'name': 'Persuasive', 'count': 30}, {'id': 2, 'name': 'Confusing', 'count': 8}, {'id': 23, 'name': 'Jaw-dropping', 'count': 37}, {'id': 22, 'name': 'Fascinating', 'count': 26}, {'id': 8, 'name': 'Informative', 'count': 24}, {'id': 26, 'name': 'Obnoxious', 'count': 13}, {'id': 11, 'name': 'Longwinded', 'count': 7}, {'id': 21, 'name': 'Unconvincing', 'count': 25}, {'id': 25, 'name': 'OK', 'count': 20}, {'id': 7, 'name': 'Funny', 'count': 2}]

We first need to convert the ratings to a list, where each element is a data frame.

library(purrr)
library(jsonlite)
library(sentimentr)

## extract the ratings and counts
rating <- gsub("'", '"', df$ratings) 
rating2 <- map(rating, fromJSON)

rating2[[2]]

##    id         name count
## 1  22  Fascinating   531
## 2   3   Courageous   345
## 3   8  Informative   376
## 4   7        Funny   936
## 5  25           OK   219
## 6  24   Persuasive   130
## 7  10    Inspiring   422
## 8   9    Ingenious   250
## 9  21 Unconvincing   150
## 10 26    Obnoxious   116
## 11 11   Longwinded   105
## 12  1    Beautiful    50
## 13  2    Confusing    52
## 14 23 Jaw-dropping    45

For each of the data frame, we apply the sentiment analysis and extract the polarity score from it for each rating (each row of column name of the data frames in rating2). The scores are stored in a list, where each element is a vector.

## calculate the polarity score
rating3 <- list()
for (i in 1:length(rating2)){
  rating3[[i]] <- sentiment(rating2[[i]][,2])$sentiment
}

head(rating3, 3)

## [[1]]
##  [1]  0.7500000  0.7500000  0.7500000  1.0000000  0.0000000 -0.5000000
##  [7]  0.7071068  0.7500000  0.0000000 -0.7500000 -0.4000000 -0.5000000
## [13]  0.0000000  0.8000000
## 
## [[2]]
##  [1]  0.7500000  1.0000000  0.0000000  0.8000000  0.0000000  0.0000000
##  [7]  0.7500000  0.7500000 -0.5000000 -0.7500000 -0.4000000  0.7500000
## [13] -0.5000000  0.7071068
## 
## [[3]]
##  [1]  0.7071068  0.7500000  0.7500000  0.0000000  0.7500000  0.0000000
##  [7]  0.0000000  0.7500000 -0.5000000 -0.5000000 -0.4000000 -0.7500000
## [13]  1.0000000  0.8000000

We then weight the polarity score of each rating by the count of each rating, which is the third column of each data frame in the list rating2. Summing up all the weighted scores with matrix multiplication (stored in a list), we get the final score for each talk.

## weight the score by counts
score <- list()
for (i in 1:length(rating2)){
  count <- rating2[[i]][,3]
  sum <- count %*% rating3[[i]]
  score[[i]] <- sum[1,1]
}

head(score)

## [[1]]
## [1] 824.213
## 
## [[2]]
## [1] 1835.37
## 
## [[3]]
## [1] 4071.868
## 
## [[4]]
## [1] 1335.326
## 
## [[5]]
## [1] 698.7851
## 
## [[6]]
## [1] 484.3053

score is unlisted to a vector and merged back to df as a new variable.

score <- unlist(score)
df$score <- score

Modifying object elements

The help file of the R sample dataset USArrests discusses which subsets of the data need to be modified. These are useful examples of modifying data frame elements.

We can access the examples by typing ?USArrests in the R console.

In the first case, the urban population UrbanPop of Maryland needs to be modified to 76.6.

d1 <- USArrests
d1["Maryland", "UrbanPop"] <- 76.6

In the second case, the urban population of several states need to be adjusted taking into account how they were rounded previously.

states1 <- c("Colorado", "Florida", "Mississippi", "Wyoming")
states2 <- c("Nebraska", "Pennsylvania")

d1[states1, "UrbanPop"] <- d1[states1, "UrbanPop"] + 0.5
d1[states2, "UrbanPop"] <- d1[states2, "UrbanPop"] - 0.5

Below we have another example. Using the data sample1, we want to replace the values of two columns depending on the value of the third column.

head(sample1[c("var", "var_avg", "var_p50")], 10)

##      var var_avg var_p50
## 1  -0.41   -0.41   -0.41
## 2     NA      NA      NA
## 3     NA      NA      NA
## 4     NA      NA      NA
## 5     NA      NA      NA
## 6     NA      NA      NA
## 7  -0.01   -0.01   -0.01
## 8     NA      NA      NA
## 9     NA      NA      NA
## 10    NA      NA      NA

We first find rows where var is missing, and then replace var_avg and var_p50 with 0.

sample1[is.na(sample1$var), c("var_avg", "var_p50")] <- c(var_avg = 0, var_p50 = 0)

This is what we can get.

head(sample1[c("var", "var_avg", "var_p50")], 10)

##      var var_avg var_p50
## 1  -0.41   -0.41   -0.41
## 2     NA    0.00    0.00
## 3     NA    0.00    0.00
## 4     NA    0.00    0.00
## 5     NA    0.00    0.00
## 6     NA    0.00    0.00
## 7  -0.01   -0.01   -0.01
## 8     NA    0.00    0.00
## 9     NA    0.00    0.00
## 10    NA    0.00    0.00

`subset()`

In addition to using the subsetting operators, we can also use the function subset() to subset a vector, a matrix, or a data frame, which meet conditions.

Below we get a subset of iris.

subset(iris, Species == "versicolor" & Petal.Width < 1.2, Sepal.Length)

##    Sepal.Length
## 58          4.9
## 61          5.0
## 63          6.0
## 68          5.8
## 70          5.6
## 80          5.7
## 81          5.5
## 82          5.5
## 94          5.0
## 99          5.1

%in% operator

%in% is an operator that returns a logical vector indicating if there is a match or not for its left operand. It is useful when we want to access elements of a vector.

Below we remove values a, b, and c from the vector y.

y <- letters
## letters is the 26 lower-case letters of the Roman alphabet built into R

y <- y[!y %in% c("a", "b", "c")]
y

##  [1] "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v"
## [20] "w" "x" "y" "z"

In another example below, we remove admin accounts from id to create a student subset.

library(dplyr)

admin <- c("tst282", "tst288", "tst424", "tst284")
student <- filter(data, !(id %in% admin))