Extracting Strings from Separators

This post was adapted from a section in The Dirty Work - Reshaping Data for Visualization. There we talked about how to extract strings from separators before plotting. The topic may be worthy of a post for itself.

The data we collected contained separators, such as commas, tabs or spaces, which made cleaning and plotting a bit annoying. Dataset like this needs to be preprocessed before we make sense of it. We will see two cases below. Each explains one situation.

Before we begin, let’s load all the packages we will need, assuming you have already installed them.

library(stringr)
library(dplyr)

Case I: Strings without separators

We have data from student submissions on “My favorite place to study in Library”. The messy raw data were recoded to be consisent across respondents.

load("~/Desktop/r sample files/survey")

head(survey$space_lib, n = 25)

##  [1] "desk"             "windows"          "428"             
##  [4] "428"              "talking"          "silent"          
##  [7] "windows"          "400"              "silent"          
## [10] "428"              "NULL"             "silent"          
## [13] "silent"           "windows"          "group study room"
## [16] "400"              "group study room" "group study room"
## [19] "400"              "407"              "407, silent"     
## [22] "other"            "other"            "group study room"
## [25] "NULL"

Most have single strings, but some have multiple ones separated by the comma.

Let’s say we want to create a frequency table counting how many times a unique value appeared.

We are not too worried because the strings themeselves do not contain the separator, which is the comma. So the program will not get confused which comma is what - is it a separator or is it part of the string. We can safely split the strings by comma.

On the contrary, in the second case below, some strings came with the separator, such as the comma in “Print, photocopy, scan”. Simply splitting those strings can be problematic.

Back to where we are.

lib <- unlist(strsplit(survey$space_lib, ","))
## strsplit() splits the elements of the character vector space_lib into substrings by the commas
## strsplit() is from base R
## the result is a long list of all the extracted strings in vectors
## we need to unlist() the list, and get a data.frame after unlist()
head(lib, n = 30)

##  [1] "desk"             "windows"          "428"             
##  [4] "428"              "talking"          "silent"          
##  [7] "windows"          "400"              "silent"          
## [10] "428"              "NULL"             "silent"          
## [13] "silent"           "windows"          "group study room"
## [16] "400"              "group study room" "group study room"
## [19] "400"              "407"              "407"             
## [22] " silent"          "other"            "other"           
## [25] "group study room" "NULL"             "silent"          
## [28] "silent"           "silent"           "407"

lib <- str_trim(lib, side = "both")
## str_trim() is a function from the stringr package
## removes the trailing and leading spaces generated in spliting

By now we are done with splitting and extracting the strings from the commas. From there we can proceed with creating the frequency table and plotting.

lib <- lib[!lib %in% c("NULL")]
## removes values coded as "NULL"

lib <- data.frame(lib) %>% count(lib) %>% data.frame() %>% arrange(-n)
## counts the frequency of each unique value
lib

##                 lib  n
## 1            silent 77
## 2               407 58
## 3           windows 47
## 4             other 31
## 5  group study room 23
## 6               400 22
## 7               428 17
## 8              desk 14
## 9              sofa 11
## 10       large desk  7
## 11           corner  6
## 12       small desk  4
## 13               4F  2
## 14            table  2
## 15           outlet  1
## 16          privacy  1
## 17          reserve  1
## 18          talking  1

Case II: Strings containing separators within

In the second case, we want to visualize the distribution of visiting library by country. The data for “top reasons to visit the library” is top_reason. The group variable is country. Because we designed the survey questions, all variable values were known to us.

head(survey$top_reason)

## [1] "Find a quiet place to study,Hang out between classes,Work on a class assignment/paper"  
## [2] "Find a quiet place to study,Hang out between classes,Work on a class assignment/paper"  
## [3] "Meet up with friends,Use a library computer,Print, photocopy, scan"                     
## [4] "Meet up with friends,Print, photocopy, scan,Get readings from Course Reserve"           
## [5] "Find a quiet place to study,Work on a class assignment/paper,Print, photocopy, scan"    
## [6] "Find a quiet place to study,Work on a class assignment/paper,Borrow books and materials"

As we see, each column contains multiple strings separated by commas. Ideally, we would like to have one unique value for a column. So like what we did before, we need to extract the individual strings from the commas. The problem is, however, this time some strings contain “,”, the separator, within themselves. For instance, “Print, photocopy, scan” has “,” between the verbs. So does “Use specialized databases (e.g. Bloomberg, Wind)”.

One solution is, as the first step, to create one column for each value of top_reason, matched with country. The cell would be filled if the case is true for an individual, meaning that person voted for a reason and is from a specific country.

Step I

Let’s first store all unique values of top_reason in a vector Reason, which will be used as a pattern in extracting the strings.

Reason <- c("Work on a class assignment/paper", "Watch video or listen audio", 
            "Use specialized databases \\(e.g. Bloomberg, Wind\\)", "Use a library computer",  
            "Use a group study room", "Print, photocopy, scan",  "Other", "Meet up with friends" , 
            "Hang out between classes", "Get readings from Course Reserve", "Get help from a librarian", 
            "Find a quiet place to study", "Borrow books and materials", "Attend a library workshop")
Reason

##  [1] "Work on a class assignment/paper"                    
##  [2] "Watch video or listen audio"                         
##  [3] "Use specialized databases \\(e.g. Bloomberg, Wind\\)"
##  [4] "Use a library computer"                              
##  [5] "Use a group study room"                              
##  [6] "Print, photocopy, scan"                              
##  [7] "Other"                                               
##  [8] "Meet up with friends"                                
##  [9] "Hang out between classes"                            
## [10] "Get readings from Course Reserve"                    
## [11] "Get help from a librarian"                           
## [12] "Find a quiet place to study"                         
## [13] "Borrow books and materials"                          
## [14] "Attend a library workshop"

Note that I added \ to “Use specialized databases (e.g. Bloomberg, Wind)”. The reason was that to quote a metacharacter (paranthese) with special meaning, double backslashes need to come before it.

Step II

We can now reformat the data to a data frame with each column being a “reason”.

r <- data.frame(survey$country)
## first we create a new data frame
## let the first column be country because we will plot on country subsets

Values of each column will be extracted from the messy strings in the original dataset, if matched with the column name. For instance, if “Find a quiet place to study”, among the other separated strings, matches with the column named “Find a quiet place to study” for an individual, the string will be added to the cell.

for (m in 1: length(Reason)){
  r[,Reason[m]] <- str_extract(survey$top_reason, Reason[m])
}
## length(Reason) counts how many elements the Reason vector contains
## str_extract(string, pattern) is from the stringr package
## r[,Reason[m]] adds each reason as a column to the data frame "r" one by one
## str_extract(survey$top_reason, Reason[m]) extracts each element of reason by the pattern Reason[m]

head(r)

##   survey.country Work on a class assignment/paper
## 1           U.S. Work on a class assignment/paper
## 2          China Work on a class assignment/paper
## 3           U.S.                             <NA>
## 4          China                             <NA>
## 5           U.S. Work on a class assignment/paper
## 6          China Work on a class assignment/paper
##   Watch video or listen audio
## 1                        <NA>
## 2                        <NA>
## 3                        <NA>
## 4                        <NA>
## 5                        <NA>
## 6                        <NA>
##   Use specialized databases \\(e.g. Bloomberg, Wind\\)
## 1                                                 <NA>
## 2                                                 <NA>
## 3                                                 <NA>
## 4                                                 <NA>
## 5                                                 <NA>
## 6                                                 <NA>
##   Use a library computer Use a group study room Print, photocopy, scan
## 1                   <NA>                   <NA>                   <NA>
## 2                   <NA>                   <NA>                   <NA>
## 3 Use a library computer                   <NA> Print, photocopy, scan
## 4                   <NA>                   <NA> Print, photocopy, scan
## 5                   <NA>                   <NA> Print, photocopy, scan
## 6                   <NA>                   <NA>                   <NA>
##   Other Meet up with friends Hang out between classes
## 1  <NA>                 <NA> Hang out between classes
## 2  <NA>                 <NA> Hang out between classes
## 3  <NA> Meet up with friends                     <NA>
## 4  <NA> Meet up with friends                     <NA>
## 5  <NA>                 <NA>                     <NA>
## 6  <NA>                 <NA>                     <NA>
##   Get readings from Course Reserve Get help from a librarian
## 1                             <NA>                      <NA>
## 2                             <NA>                      <NA>
## 3                             <NA>                      <NA>
## 4 Get readings from Course Reserve                      <NA>
## 5                             <NA>                      <NA>
## 6                             <NA>                      <NA>
##   Find a quiet place to study Borrow books and materials
## 1 Find a quiet place to study                       <NA>
## 2 Find a quiet place to study                       <NA>
## 3                        <NA>                       <NA>
## 4                        <NA>                       <NA>
## 5 Find a quiet place to study                       <NA>
## 6 Find a quiet place to study Borrow books and materials
##   Attend a library workshop
## 1                      <NA>
## 2                      <NA>
## 3                      <NA>
## 4                      <NA>
## 5                      <NA>
## 6                      <NA>

table(r$`Use specialized databases \\(e.g. Bloomberg, Wind\\)`)

## 
## Use specialized databases (e.g. Bloomberg, Wind) 
##                                               17

## You may try in the earlier steps quoting without backlashes. No values will be carried over here. The frequency will be 0 in this table.

Step III

Thinking of our goal again - what we need is a “frequency table” by country. We need to reshape the data to that structure.

We need a data frame where the reasons came into one column.

dtset <- data.frame(Reason)
dtset

##                                                  Reason
## 1                      Work on a class assignment/paper
## 2                           Watch video or listen audio
## 3  Use specialized databases \\(e.g. Bloomberg, Wind\\)
## 4                                Use a library computer
## 5                                Use a group study room
## 6                                Print, photocopy, scan
## 7                                                 Other
## 8                                  Meet up with friends
## 9                              Hang out between classes
## 10                     Get readings from Course Reserve
## 11                            Get help from a librarian
## 12                          Find a quiet place to study
## 13                           Borrow books and materials
## 14                            Attend a library workshop

levels(dtset$Reason)[levels(dtset$Reason) == "Use specialized databases \\(e.g. Bloomberg, Wind\\)"] <- "Use specialized databases (e.g. Bloomberg, Wind)"
## reset the values of factor Reason: we've done quoting and let's remove the backslashes

Next we will summarize the distribution by country for each “reason”.

g <- c("China", "U.S.", "Other")

for (n in 1:length(g)){
  dtset[,g[n]] <- apply(r[r[] == g[n], 2:15], 2, function(x) length(which(!is.na(x))))
}
## adding one group member a time to the data.frame
## g[n]: vector elements ("China", "U.S.", "Other")
## apply(subset[rows where elements == values of the group vector, columns applied], rows, function(counts non-missing cases))

dtset$Total<- rowSums(dtset[,2:4], na.rm = TRUE, dims = 1)
## creates a new column to calculate the total

dtset <- dtset[order(dtset$Total,decreasing = T),]
## reorder the rows by Total

dtset

##                                              Reason China U.S. Other Total
## 12                      Find a quiet place to study   123   47    50   220
## 6                            Print, photocopy, scan   104   45    52   201
## 1                  Work on a class assignment/paper    89   46    31   166
## 13                       Borrow books and materials    63   15    28   106
## 10                 Get readings from Course Reserve    30    6     8    44
## 4                            Use a library computer    11   20    10    41
## 5                            Use a group study room    32    4     5    41
## 8                              Meet up with friends    23    4     9    36
## 9                          Hang out between classes    16    6     5    27
## 11                        Get help from a librarian    13    2     7    22
## 3  Use specialized databases (e.g. Bloomberg, Wind)    12    3     2    17
## 14                        Attend a library workshop     9    5     3    17
## 2                       Watch video or listen audio     7    2     2    11
## 7                                             Other     2    2     1     5

Done.