In Vectors and Data Structures, we discussed how to index/subset vectors, matrices, arrays, lists and data frames. In this post, we will explore more subsetting techniques with a few cases.
Objects can be extended by subsetting operators.
v <- c(1,2,3)
v[4] <- 4
v
## [1] 1 2 3 4
This technique is useful when we need to create an object to store temporary outputs from a series of operations (e.g. in a loop).
Extending the example above, we can add elements to an empty vector one by one in a loop by extending the elements of the vector.
output <- c()
i <- 1
for (i in 1:5){
output[i] <- i
}
output
## [1] 1 2 3 4 5
Extending the example above further, we can add elements to rows and columns of an empty data frame in a loop by extending its elements.
multiplier <- c(1:5)
output <- data.frame()
i <- 1
for (i in multiplier){
output[i,1] <- multiplier[i]
output[i,2] <- paste(2, "*", multiplier[i], "=", 2 * multiplier[i])
}
names(output) <- c("multiplier", "output")
output
## multiplier output
## 1 1 2 * 1 = 2
## 2 2 2 * 2 = 4
## 3 3 2 * 3 = 6
## 4 4 2 * 4 = 8
## 5 5 2 * 5 = 10
Now we will look at how storing temporary outputs by extending object elements could be useful.
In the example below, we use the dataset df
. Each row of df
describes a talk. We want to add a new column num_tweet
to the existing data frame df
. num_tweet
would be the number of tweets mentioning one speaker of a talk, binded to each row of main_speaker
in df
. Before that, we need to create a new vector to store the number of tweets for each speaker.
head(df["main_speaker"])
## main_speaker
## 1 Aicha el-Wafi + Phyllis Rodriguez
## 2 AJ Jacobs
## 3 Markus Fischer
## 4 Improv Everywhere
## 5 Geert Chatrou
## 6 Aakash Odedra
In each iteration of the loop, a query with the speaker name is sent to Twitter to collect all tweets mentioning the speaker. The count of the tweets for each speaker is then calculated and stored in the vector num_tweets
.
library(rtweet)
## creating the query term
df$speaker <- gsub(pattern = " and |,| \\+ ", replacement = " OR ", df$main_speaker)
speaker <- df$main_speaker
## creating a new vector to store the number of tweets for each speaker
num_tweets <- double()
for (i in 1:length(df$speaker)){
tweets <- nrow(search_tweets(q = speaker[i], retryonratelimit = TRUE))
num_tweets[i] <- tweets
}
df$num_tweet <- num_tweets
Using the same data, below we want to extract the polarity scores from the ratings
for each talk, weight each polarity score by how many people gave the rating, and sum up all the weighted polarity scores of each talk to get a final “sentiment score”.
ratings
came in a dictionary type of structure.
head(df["ratings"], 1)
## ratings
## 1 [{'id': 10, 'name': 'Inspiring', 'count': 385}, {'id': 1, 'name': 'Beautiful', 'count': 229}, {'id': 9, 'name': 'Ingenious', 'count': 10}, {'id': 3, 'name': 'Courageous', 'count': 338}, {'id': 24, 'name': 'Persuasive', 'count': 30}, {'id': 2, 'name': 'Confusing', 'count': 8}, {'id': 23, 'name': 'Jaw-dropping', 'count': 37}, {'id': 22, 'name': 'Fascinating', 'count': 26}, {'id': 8, 'name': 'Informative', 'count': 24}, {'id': 26, 'name': 'Obnoxious', 'count': 13}, {'id': 11, 'name': 'Longwinded', 'count': 7}, {'id': 21, 'name': 'Unconvincing', 'count': 25}, {'id': 25, 'name': 'OK', 'count': 20}, {'id': 7, 'name': 'Funny', 'count': 2}]
We first need to convert the ratings
to a list, where each element is a data frame.
library(purrr)
library(jsonlite)
library(sentimentr)
## extract the ratings and counts
rating <- gsub("'", '"', df$ratings)
rating2 <- map(rating, fromJSON)
rating2[[2]]
## id name count
## 1 22 Fascinating 531
## 2 3 Courageous 345
## 3 8 Informative 376
## 4 7 Funny 936
## 5 25 OK 219
## 6 24 Persuasive 130
## 7 10 Inspiring 422
## 8 9 Ingenious 250
## 9 21 Unconvincing 150
## 10 26 Obnoxious 116
## 11 11 Longwinded 105
## 12 1 Beautiful 50
## 13 2 Confusing 52
## 14 23 Jaw-dropping 45
For each of the data frame, we apply the sentiment analysis and extract the polarity score from it for each rating (each row of column name
of the data frames in rating2
). The scores are stored in a list, where each element is a vector.
## calculate the polarity score
rating3 <- list()
for (i in 1:length(rating2)){
rating3[[i]] <- sentiment(rating2[[i]][,2])$sentiment
}
head(rating3, 3)
## [[1]]
## [1] 0.7500000 0.7500000 0.7500000 1.0000000 0.0000000 -0.5000000
## [7] 0.7071068 0.7500000 0.0000000 -0.7500000 -0.4000000 -0.5000000
## [13] 0.0000000 0.8000000
##
## [[2]]
## [1] 0.7500000 1.0000000 0.0000000 0.8000000 0.0000000 0.0000000
## [7] 0.7500000 0.7500000 -0.5000000 -0.7500000 -0.4000000 0.7500000
## [13] -0.5000000 0.7071068
##
## [[3]]
## [1] 0.7071068 0.7500000 0.7500000 0.0000000 0.7500000 0.0000000
## [7] 0.0000000 0.7500000 -0.5000000 -0.5000000 -0.4000000 -0.7500000
## [13] 1.0000000 0.8000000
We then weight the polarity score of each rating by the count
of each rating, which is the third column of each data frame in the list rating2
. Summing up all the weighted scores with matrix multiplication (stored in a list), we get the final score for each talk.
## weight the score by counts
score <- list()
for (i in 1:length(rating2)){
count <- rating2[[i]][,3]
sum <- count %*% rating3[[i]]
score[[i]] <- sum[1,1]
}
head(score)
## [[1]]
## [1] 824.213
##
## [[2]]
## [1] 1835.37
##
## [[3]]
## [1] 4071.868
##
## [[4]]
## [1] 1335.326
##
## [[5]]
## [1] 698.7851
##
## [[6]]
## [1] 484.3053
score
is unlisted to a vector and merged back to df
as a new variable.
score <- unlist(score)
df$score <- score
The help file of the R sample dataset USArrests
discusses which subsets of the data need to be modified. These are useful examples of modifying data frame elements.
We can access the examples by typing ?USArrests
in the R console.
In the first case, the urban population UrbanPop
of Maryland needs to be modified to 76.6.
d1 <- USArrests
d1["Maryland", "UrbanPop"] <- 76.6
In the second case, the urban population of several states need to be adjusted taking into account how they were rounded previously.
states1 <- c("Colorado", "Florida", "Mississippi", "Wyoming")
states2 <- c("Nebraska", "Pennsylvania")
d1[states1, "UrbanPop"] <- d1[states1, "UrbanPop"] + 0.5
d1[states2, "UrbanPop"] <- d1[states2, "UrbanPop"] - 0.5
Below we have another example. Using the data sample1
, we want to replace the values of two columns depending on the value of the third column.
head(sample1[c("var", "var_avg", "var_p50")], 10)
## var var_avg var_p50
## 1 -0.41 -0.41 -0.41
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## 7 -0.01 -0.01 -0.01
## 8 NA NA NA
## 9 NA NA NA
## 10 NA NA NA
We first find rows where var
is missing, and then replace var_avg
and var_p50
with 0.
sample1[is.na(sample1$var), c("var_avg", "var_p50")] <- c(var_avg = 0, var_p50 = 0)
This is what we can get.
head(sample1[c("var", "var_avg", "var_p50")], 10)
## var var_avg var_p50
## 1 -0.41 -0.41 -0.41
## 2 NA 0.00 0.00
## 3 NA 0.00 0.00
## 4 NA 0.00 0.00
## 5 NA 0.00 0.00
## 6 NA 0.00 0.00
## 7 -0.01 -0.01 -0.01
## 8 NA 0.00 0.00
## 9 NA 0.00 0.00
## 10 NA 0.00 0.00
subset()
In addition to using the subsetting operators, we can also use the function subset()
to subset a vector, a matrix, or a data frame, which meet conditions.
Below we get a subset of iris
.
subset(iris, Species == "versicolor" & Petal.Width < 1.2, Sepal.Length)
## Sepal.Length
## 58 4.9
## 61 5.0
## 63 6.0
## 68 5.8
## 70 5.6
## 80 5.7
## 81 5.5
## 82 5.5
## 94 5.0
## 99 5.1
%in%
is an operator that returns a logical vector indicating if there is a match or not for its left operand. It is useful when we want to access elements of a vector.
Below we remove values a
, b
, and c
from the vector y
.
y <- letters
## letters is the 26 lower-case letters of the Roman alphabet built into R
y <- y[!y %in% c("a", "b", "c")]
y
## [1] "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v"
## [20] "w" "x" "y" "z"
In another example below, we remove admin
accounts from id
to create a student subset.
library(dplyr)
admin <- c("tst282", "tst288", "tst424", "tst284")
student <- filter(data, !(id %in% admin))