Strings

Common tasks with string variables include matching a pattern or searching a location in a string, parsing or subsetting a string, concatenating strings, converting a string to a number etc.

A full list of string functions.

Matching and searching

regexm(); regexr(); regexs()

These are the three functions that use regular expressions to perform matching. Regular expression is a method that allows for systematic searching, matching and replacing within strings using operators and letters.

Read how Stata’s regular expressions work:
Kevin S. Turner, What are regular expressions and how can I use them in Stata?

regexm()

regexm(s,re) performs matching on the string s by regular expression re. If the matching is successful, it returns 1; otherwise it returns 0.

For instance, gen flag = regexm(id, "[^0-9 .]") marks observations that contain numeric characters. Then if we want to drop cases with non-numeric characters, we can easily do that.

regexr()

regexr(s1,re,s2) replaces the string s1 that matches the regular expression re with s2.

. gen id2 = regexr(id, "[^0-9 .]", "marked") generates a new variable id2. Cases with all numeric characters have been replaced by "marked".

regexs()

regexs(n) must appear after regexm(s,re) and returns the n^th substring from what is previously matched by regexm(s,re).

. sysuse auto
. gen make2 = regexs(1) if regexm(make, "([a-zA-Z]+)([ ]|[.])*([a-zA-Z]+)") takes out the car make, the first part from its make and model combination in make strings.

. list make make2 in 5/15


     +------------------------+
     | make             make2 |
     |------------------------|
  5. | Buick Electra    Buick |
  6. | Buick LeSabre    Buick |
  7. | Buick Opel       Buick |
  8. | Buick Regal      Buick |
  9. | Buick Riviera    Buick |
     |------------------------|
 10. | Buick Skylark    Buick |
 11. | Cad. Deville       Cad |
 12. | Cad. Eldorado      Cad |
 13. | Cad. Seville       Cad |
 14. | Chev. Chevette    Chev |
     |------------------------|
 15. | Chev. Impala      Chev |
     +------------------------+

In this example above:
[] matches at least one of the values in the bracket;
a-zA-Z matches any value of the letters, uppercase or lowercase; 0-9, similarly, will match any numbers in the specified numeric range;
* matches zero or more of the preceding expression;
| is the logical operator “or”.

More examples on the three functions:
UCLA: Statistical Consulting Group, How can I extract a portion of a string variable using regular expressions? | Stata FAQ

strpos(); strrpos()

strpos(s1, s2) returns the position of s2 where first found in s1;
strrpos(s1, s2) returns the position of s2 where last found in s1；
when positions are not found they return 0.

. sysuse auto
. gen position = strpos(make,".") creates the new variable position that gives us the position of “.” in make.

. list make position in 5/15


     +---------------------------+
     | make             position |
     |---------------------------|
  5. | Buick Electra           0 |
  6. | Buick LeSabre           0 |
  7. | Buick Opel              0 |
  8. | Buick Regal             0 |
  9. | Buick Riviera           0 |
     |---------------------------|
 10. | Buick Skylark           0 |
 11. | Cad. Deville            4 |
 12. | Cad. Eldorado           4 |
 13. | Cad. Seville            4 |
 14. | Chev. Chevette          5 |
     |---------------------------|
 15. | Chev. Impala            5 |
     +---------------------------+

Parsing and extracting

split

split splits string variables by separators into several components, and generates new string variables for each component taken out from the original string. The default separator is the space.

The parse() option allows one to specify the separator. The generate() option allows one to change the prefix of the new variables to be generated.

. sysuse auto
. split make, p("." " ") gen(substr) splits make by “.” and space into four new variables substr1-substr4.

. list make substr1-substr4 in 10/20


     +------------------------------------------------------------+
     | make                substr1   substr2    substr3   substr4 |
     |------------------------------------------------------------|
 10. | Buick Skylark         Buick   Skylark                      |
 11. | Cad. Deville            Cad              Deville           |
 12. | Cad. Eldorado           Cad             Eldorado           |
 13. | Cad. Seville            Cad              Seville           |
 14. | Chev. Chevette         Chev             Chevette           |
     |------------------------------------------------------------|
 15. | Chev. Impala           Chev               Impala           |
 16. | Chev. Malibu           Chev               Malibu           |
 17. | Chev. Monte Carlo      Chev                Monte     Carlo |
 18. | Chev. Monza            Chev                Monza           |
 19. | Chev. Nova             Chev                 Nova           |
     |------------------------------------------------------------|
 20. | Dodge Colt            Dodge      Colt                      |
     +------------------------------------------------------------+

substr()

substr(s, n1, n2) extracts the substring of s from n1 for the length of n2.

Let's say we have a dataset with a string variable time and we want to extract the date and time components to the seconds.
. list time in 1/5


     +----------------------------------+
     |                             time |
     |----------------------------------|
  1. | 2016-03-31T04:20:00.122823+00:00 |
  2. | 2016-03-31T04:24:25.447985+00:00 |
  3. | 2016-03-31T04:24:25.683670+00:00 |
  4. | 2016-03-31T04:24:27.338571+00:00 |
  5. | 2016-03-31T04:24:27.552821+00:00 |
     +----------------------------------+

. gen datetime = substr(time, 1, 19) generates a new variable datetime with the first 19 characters of time. From there we can convert it to a date and time variable.
. list datetime in 1/5


     +---------------------+
     |               clock |
     |---------------------|
  1. | 2016-03-31T04:20:00 |
  2. | 2016-03-31T04:24:25 |
  3. | 2016-03-31T04:24:25 |
  4. | 2016-03-31T04:24:27 |
  5. | 2016-03-31T04:24:27 |
     +---------------------+

subinstr()

subinstr(s1, s2, s3, n) replaces the first n occurrences in s1 of s2 with s3. If n is missing, all occurrences are replaced.

Here we have a dataset with a variable rank:
. tab rank


               rank |      Freq.     Percent        Cum.
------------------------------+-----------------------------------
                  1 |        506       10.44       10.44
             1 (1T) |        262        5.40       15.84
              1(1T) |        122        2.52       18.36
            1(1T)   |         72        1.49       19.84
                  2 |        509       10.50       30.34
             2 (2T) |        262        5.40       35.75
              2(2T) |        122        2.52       38.26
             (2T)   |         72        1.49       39.75
                  3 |        503       10.38       50.12
             3 (3T) |        260        5.36       55.49
              3(3T) |        121        2.50       57.98
             (3T)   |         72        1.49       59.47
                  4 |        442        9.12       68.58
             4 (RU) |        228        4.70       73.29
              4(RU) |        107        2.21       75.50
             4(RU)  |         63        1.30       76.79
                  5 |        324        6.68       83.48
             5 (RU) |        162        3.34       86.82
              5(RU) |         82        1.69       88.51
            5(RU)   |         46        0.95       89.46

      ......

We want to remove all the spaces, (1T), (2T), (3T) and (RU). Here is what we would do:


local n (1T) (2T) (3T) (RU) " "
foreach x in `n'{
	replace rank = subinstr(rank,"`x'", "",.) 
}

. tab rank


        rank |      Freq.     Percent        Cum.
-------------+-----------------------------------
           1 |        962       19.84       19.84
           2 |        965       19.91       39.75
           3 |        956       19.72       59.47
           4 |        840       17.33       76.79
           5 |        614       12.67       89.46
           6 |        333        6.87       96.33
           7 |        127        2.62       98.95
           8 |         44        0.91       99.86
           9 |          7        0.14      100.00
-------------+-----------------------------------
       Total |      4,848      100.00

egen newvar = ends()

egen newvar = ends() takes out whatever precedes the first space in the string, or the entire string if the string variable does not contain a space. Space is the default separator.

. egen make3 = ends(make) takes out the car make from the combination of make and model by the space between the two.
. list make make3 in 5/15


     +------------------------+
     | make             make3 |
     |------------------------|
  5. | Buick Electra    Buick |
  6. | Buick LeSabre    Buick |
  7. | Buick Opel       Buick |
  8. | Buick Regal      Buick |
  9. | Buick Riviera    Buick |
     |------------------------|
 10. | Buick Skylark    Buick |
 11. | Cad. Deville      Cad. |
 12. | Cad. Eldorado     Cad. |
 13. | Cad. Seville      Cad. |
 14. | Chev. Chevette   Chev. |
     |------------------------|
 15. | Chev. Impala     Chev. |
     +------------------------+

The punct() option allows one to change where to parse the substring; the default is to parse on the space.
. egen make4 = ends(make), punct(.) takes out either the portion precedes the “.”, or the entire string without “.”.
. list make make4 in 5/15


     +--------------------------------+
     | make                     make4 |
     |--------------------------------|
  5. | Buick Electra    Buick Electra |
  6. | Buick LeSabre    Buick LeSabre |
  7. | Buick Opel          Buick Opel |
  8. | Buick Regal        Buick Regal |
  9. | Buick Riviera    Buick Riviera |
     |--------------------------------|
 10. | Buick Skylark    Buick Skylark |
 11. | Cad. Deville               Cad |
 12. | Cad. Eldorado              Cad |
 13. | Cad. Seville               Cad |
 14. | Chev. Chevette            Chev |
     |--------------------------------|
 15. | Chev. Impala              Chev |
     +--------------------------------+

The punct() trim head|last|tail option further allows one to choose the portion of the string to take out: head, the first substring; last, the last substring; or tail, the remaining substring following the first parsing character.
. egen make5 = ends(make), trim last parses out the last portion from make.
. list make make5 in 5/15


     +---------------------------+
     | make                make5 |
     |---------------------------|
  5. | Buick Electra     Electra |
  6. | Buick LeSabre     LeSabre |
  7. | Buick Opel           Opel |
  8. | Buick Regal         Regal |
  9. | Buick Riviera     Riviera |
     |---------------------------|
 10. | Buick Skylark     Skylark |
 11. | Cad. Deville      Deville |
 12. | Cad. Eldorado    Eldorado |
 13. | Cad. Seville      Seville |
 14. | Chev. Chevette   Chevette |
     |---------------------------|
 15. | Chev. Impala       Impala |
     +---------------------------+

strtrim(); stritrim();strltrim();strrtrim()

The four functions trim the strings by removing the spaces.

strtrim(s) removes the leading or trailing spaces.
e.g. strtrim(“ nyush ”) = “nyush”

stritrim(s) removes the multiple internal spaces.
e.g. stritrim(“nyu sh”) = “nyu sh”

strltrim(s) removes the leading spaces.
e.g. strltrim(“ nyush”) = “nyush”

strrtrim(s) removes the trailing spaces.
e.g. strrtrim(“nyush ”) = “nyush”

Conversion

string(n) and real(s) are two string functions that convert numeric/string to string/numeric variables. Note that real()/string() are functions and must be used in conjunction with a Stata command.

string()

string(n) is a synonym for strofreal(n) and converts numeric or missing values to strings.
. sysuse auto
. gen rep78_str = string(rep78) converts the numeric and missing values of rep78 to strings.

real()

real(s) converts strings to numeric or missing values. The string variable must contain number characters, otherwise missing values will be generated.
. gen rep78_str_num = real(rep78_str) converts the string variable rep78_str created above by string(n) back to numeric again.
. gen make_real = real(make) will create the new variable make_real with all missing values.

Concatenation

egen newvar, concat(varlist) concatenates a variable list and generates a new string variable.
. sysuse auto
. egen make_foreign = concat(foreign make)

Or simply use + to concatenate several strings when the variables are already strings.
. gen make_foreign2 = string(foreign) + make