Common tasks with string variables include matching a pattern or searching a location in a string, parsing or subsetting a string, concatenating strings, converting a string to a number etc.
These are the three functions that use regular expressions to perform matching. Regular expression is a method that allows for systematic searching, matching and replacing within strings using operators and letters.
Read how Stata’s regular expressions work:
Kevin S. Turner, What are regular expressions and how can I use them in Stata?
regexm(s,re) performs matching on the string s by regular expression re. If the matching is successful, it returns 1; otherwise it returns 0.
For instance, gen flag = regexm(id, "[^0-9 .]") marks observations that contain numeric characters. Then if we want to drop cases with non-numeric characters, we can easily do that.
regexr(s1,re,s2) replaces the string s1 that matches the regular expression re with s2.
. gen id2 = regexr(id, "[^0-9 .]", "marked") generates a new variable id2. Cases with all numeric characters have been replaced by "marked".
regexs(n) must appear after regexm(s,re) and returns the nth substring from what is previously matched by regexm(s,re).
. sysuse auto
. gen make2 = regexs(1) if regexm(make, "([a-zA-Z]+)([ ]|[.])*([a-zA-Z]+)") takes out the car make, the first part from its make and model combination in make strings.
. list make make2 in 5/15
+------------------------+
| make make2 |
|------------------------|
5. | Buick Electra Buick |
6. | Buick LeSabre Buick |
7. | Buick Opel Buick |
8. | Buick Regal Buick |
9. | Buick Riviera Buick |
|------------------------|
10. | Buick Skylark Buick |
11. | Cad. Deville Cad |
12. | Cad. Eldorado Cad |
13. | Cad. Seville Cad |
14. | Chev. Chevette Chev |
|------------------------|
15. | Chev. Impala Chev |
+------------------------+
In this example above:
[] matches at least one of the values in the bracket;
a-zA-Z matches any value of the letters, uppercase or lowercase; 0-9, similarly, will match any numbers in the specified numeric range;
* matches zero or more of the preceding expression;
| is the logical operator “or”.
More examples on the three functions:
UCLA: Statistical Consulting Group, How can I extract a portion of a string variable using regular expressions? | Stata FAQ
strpos(s1, s2) returns the position of s2 where first found in s1;
strrpos(s1, s2) returns the position of s2 where last found in s1;
when positions are not found they return 0.
. sysuse auto
. gen position = strpos(make,".") creates the new variable position that gives us the position of “.” in make.
. list make position in 5/15
+---------------------------+
| make position |
|---------------------------|
5. | Buick Electra 0 |
6. | Buick LeSabre 0 |
7. | Buick Opel 0 |
8. | Buick Regal 0 |
9. | Buick Riviera 0 |
|---------------------------|
10. | Buick Skylark 0 |
11. | Cad. Deville 4 |
12. | Cad. Eldorado 4 |
13. | Cad. Seville 4 |
14. | Chev. Chevette 5 |
|---------------------------|
15. | Chev. Impala 5 |
+---------------------------+
split splits string variables by separators into several components, and generates new string variables for each component taken out from the original string. The default separator is the space.
The parse() option allows one to specify the separator. The generate() option allows one to change the prefix of the new variables to be generated.
. sysuse auto
. split make, p("." " ") gen(substr) splits make by “.” and space into four new variables substr1-substr4.
. list make substr1-substr4 in 10/20
+------------------------------------------------------------+
| make substr1 substr2 substr3 substr4 |
|------------------------------------------------------------|
10. | Buick Skylark Buick Skylark |
11. | Cad. Deville Cad Deville |
12. | Cad. Eldorado Cad Eldorado |
13. | Cad. Seville Cad Seville |
14. | Chev. Chevette Chev Chevette |
|------------------------------------------------------------|
15. | Chev. Impala Chev Impala |
16. | Chev. Malibu Chev Malibu |
17. | Chev. Monte Carlo Chev Monte Carlo |
18. | Chev. Monza Chev Monza |
19. | Chev. Nova Chev Nova |
|------------------------------------------------------------|
20. | Dodge Colt Dodge Colt |
+------------------------------------------------------------+
substr(s, n1, n2) extracts the substring of s from n1 for the length of n2.
Let's say we have a dataset with a string variable time and we want to extract the date and time components to the seconds.
. list time in 1/5
+----------------------------------+
| time |
|----------------------------------|
1. | 2016-03-31T04:20:00.122823+00:00 |
2. | 2016-03-31T04:24:25.447985+00:00 |
3. | 2016-03-31T04:24:25.683670+00:00 |
4. | 2016-03-31T04:24:27.338571+00:00 |
5. | 2016-03-31T04:24:27.552821+00:00 |
+----------------------------------+
. gen datetime = substr(time, 1, 19) generates a new variable datetime with the first 19 characters of time. From there we can convert it to a date and time variable.
. list datetime in 1/5
+---------------------+
| clock |
|---------------------|
1. | 2016-03-31T04:20:00 |
2. | 2016-03-31T04:24:25 |
3. | 2016-03-31T04:24:25 |
4. | 2016-03-31T04:24:27 |
5. | 2016-03-31T04:24:27 |
+---------------------+
subinstr(s1, s2, s3, n) replaces the first n occurrences in s1 of s2 with s3. If n is missing, all occurrences are replaced.
Here we have a dataset with a variable rank:
. tab rank
rank | Freq. Percent Cum.
------------------------------+-----------------------------------
1 | 506 10.44 10.44
1 (1T) | 262 5.40 15.84
1(1T) | 122 2.52 18.36
1(1T) | 72 1.49 19.84
2 | 509 10.50 30.34
2 (2T) | 262 5.40 35.75
2(2T) | 122 2.52 38.26
(2T) | 72 1.49 39.75
3 | 503 10.38 50.12
3 (3T) | 260 5.36 55.49
3(3T) | 121 2.50 57.98
(3T) | 72 1.49 59.47
4 | 442 9.12 68.58
4 (RU) | 228 4.70 73.29
4(RU) | 107 2.21 75.50
4(RU) | 63 1.30 76.79
5 | 324 6.68 83.48
5 (RU) | 162 3.34 86.82
5(RU) | 82 1.69 88.51
5(RU) | 46 0.95 89.46
......
We want to remove all the spaces, (1T), (2T), (3T) and (RU). Here is what we would do:
local n (1T) (2T) (3T) (RU) " "
foreach x in `n'{
replace rank = subinstr(rank,"`x'", "",.)
}
. tab rank
rank | Freq. Percent Cum.
-------------+-----------------------------------
1 | 962 19.84 19.84
2 | 965 19.91 39.75
3 | 956 19.72 59.47
4 | 840 17.33 76.79
5 | 614 12.67 89.46
6 | 333 6.87 96.33
7 | 127 2.62 98.95
8 | 44 0.91 99.86
9 | 7 0.14 100.00
-------------+-----------------------------------
Total | 4,848 100.00
egen newvar = ends() takes out whatever precedes the first space in the string, or the entire string if the string variable does not contain a space. Space is the default separator.
. egen make3 = ends(make) takes out the car make from the combination of make and model by the space between the two.
. list make make3 in 5/15
+------------------------+
| make make3 |
|------------------------|
5. | Buick Electra Buick |
6. | Buick LeSabre Buick |
7. | Buick Opel Buick |
8. | Buick Regal Buick |
9. | Buick Riviera Buick |
|------------------------|
10. | Buick Skylark Buick |
11. | Cad. Deville Cad. |
12. | Cad. Eldorado Cad. |
13. | Cad. Seville Cad. |
14. | Chev. Chevette Chev. |
|------------------------|
15. | Chev. Impala Chev. |
+------------------------+
The punct() option allows one to change where to parse the substring; the default is to parse on the space.
. egen make4 = ends(make), punct(.) takes out either the portion precedes the “.”, or the entire string without “.”.
. list make make4 in 5/15
+--------------------------------+
| make make4 |
|--------------------------------|
5. | Buick Electra Buick Electra |
6. | Buick LeSabre Buick LeSabre |
7. | Buick Opel Buick Opel |
8. | Buick Regal Buick Regal |
9. | Buick Riviera Buick Riviera |
|--------------------------------|
10. | Buick Skylark Buick Skylark |
11. | Cad. Deville Cad |
12. | Cad. Eldorado Cad |
13. | Cad. Seville Cad |
14. | Chev. Chevette Chev |
|--------------------------------|
15. | Chev. Impala Chev |
+--------------------------------+
The punct() trim head|last|tail option further allows one to choose the portion of the string to take out: head, the first substring; last, the last substring; or tail, the remaining substring following the first parsing character.
. egen make5 = ends(make), trim last parses out the last portion from make.
. list make make5 in 5/15
+---------------------------+
| make make5 |
|---------------------------|
5. | Buick Electra Electra |
6. | Buick LeSabre LeSabre |
7. | Buick Opel Opel |
8. | Buick Regal Regal |
9. | Buick Riviera Riviera |
|---------------------------|
10. | Buick Skylark Skylark |
11. | Cad. Deville Deville |
12. | Cad. Eldorado Eldorado |
13. | Cad. Seville Seville |
14. | Chev. Chevette Chevette |
|---------------------------|
15. | Chev. Impala Impala |
+---------------------------+
The four functions trim the strings by removing the spaces.
strtrim(s) removes the leading or trailing spaces.
e.g. strtrim(“ nyush ”) = “nyush”
stritrim(s) removes the multiple internal spaces.
e.g. stritrim(“nyu     sh”) = “nyu sh”
strltrim(s) removes the leading spaces.
e.g. strltrim(“ nyush”) = “nyush”
strrtrim(s) removes the trailing spaces.
e.g. strrtrim(“nyush ”) = “nyush”
string(n) and real(s) are two string functions that convert numeric/string to string/numeric variables. Note that real()/string() are functions and must be used in conjunction with a Stata command.
string(n) is a synonym for strofreal(n) and converts numeric or missing values to strings.
. sysuse auto
. gen rep78_str = string(rep78) converts the numeric and missing values of rep78 to strings.
real(s) converts strings to numeric or missing values. The string variable must contain number characters, otherwise missing values will be generated.
. gen rep78_str_num = real(rep78_str) converts the string variable rep78_str created above by string(n) back to numeric again.
. gen make_real = real(make) will create the new variable make_real with all missing values.
egen newvar, concat(varlist) concatenates a variable list and generates a new string variable.
. sysuse auto
. egen make_foreign = concat(foreign make)
Or simply use + to concatenate several strings when the variables are already strings.
. gen make_foreign2 = string(foreign) + make