The five commands below are often used to create or modify variables.
generate newvar = exp creates the new variable from existing variables through an expression.
. sysuse auto
. gen price1 = price/2.5
replace changes the contents of a variable. It can be used in combination with generate to recode the newly created variable from the existing numeric variable.
. gen price2 = price
. replace price2 = price*0.8 if foreign == 1
recode var (rule)..., generate(newvar) changes the contents of numeric variables, usually to create categorical variables.
. sysuse auto
. recode rep78 (min/2 = 1 good) (3 = 2 fair) (4 5 = 3 poor) (missing = .), gen(rep78_scale) label(repair_record_scale)
Some notes for recode:
Values specified in the () includes the two boundaries.
Numbers to the left of = are values to be recoded, while the number following = is the new value to be assigned.
/ means “through”.
min refers to the smallest value; max refers to the largest value.
missing indicates all missing values; nonmissing for all nonmissing values; and else for both missing and non-missing values.
If there are values unassigned, they will be taken to the new variable as they are.
Contents following the assigned value is the value label (e.g. good).
label() gives a name to the new value label.
To check if the transformations have worked as we would like them to be, it is always a good idea to cross-tabulate the newly defined variables and the variables created from.
. tab rep78 rep78_scale
Repair | RECODE of rep78 (Repair Record 1978)
Record |
1978 | good fair poor | Total
-----------+---------------------------------+----------
1 | 2 0 0 | 2
2 | 8 0 0 | 8
3 | 0 30 0 | 30
4 | 0 0 18 | 18
5 | 0 0 11 | 11
-----------+---------------------------------+----------
Total | 10 30 29 | 69
egen is the extended generate. It requires a function to be specified to generate a new variable: egen newvar = function().
Functions include mean(), sd(), min(), max(), rowmean(), diff(), total(), std(), group() etc. Type help egen to view a complete list and descriptions of the functions that go with egen.
Below we will see some common usage of egen.
. sysuse auto
. egen total_weight=total(weight), by(foreign) creates the total car weight by car type.
Note that egen newvar = total() treats missing values as 0. Therefore, if we want to include only the nonmissing cases, we need to
. egen total_weight = total(weight) if !missing(weight), by(foreign)
. egen car_space = rowmean(headroom length) creates an arbitrary measure for car space using the mean of headroom and car length.
Note that if one of the two variables headroom and length is missing, egen newvar = rowmean() will ignore the missing observations and use the non-missing observations for calculation. If both are missing, egen newvar = rowmean() will then return a missing value. In this example neither variable contains missing values.
Compare this method with the generate method:
. gen car_space2 = (headroom + length)/2 where if any of the variables has missing values, generate will ignore the entire rows and return missing values.
egen group_id = group(old_group_var) creates a new group id with numeric values for the categorical variable.
. sysuse citytemp
. egen region_id = group(region)
. tostring(region_id), replace
generates a new group id with values from 1 to 4 for the categorical variable region and then converts the id variable to a string.
egen newvar = cut(var),at(#,#,…,#) provides one more method of recoding numeric to categorical variables. # specifies the cut-offs with its left-side being inclusive.
. egen price3 = cut(price),at(3291,5000,15906) recodes price into price3 with three intervals [3291,5000), [5000, 15906), and [5000, 15906).
egen newvar = cut(var),group(#) alternatively divides the newly defined variable into groups of equal frequencies.
. egen price4 = cut(price), group(5) generates price4 into 5 groups of the same size.
Another way to convert numeric variables to categorical or factor variables is to use autocode, the automated version of recode.
generate newvar = autocode(var, n, a, b) recodes the numeric variable var into a categorical variable newvar with n equal-length intervals; a and b are the two boundaries where a is inclusive.
. sysuse auto
. sort price
. gen price5 = autocode(price,3,price[1],price[_N])
recodes the numeric variable price into a variable of three categories ranging from its minimum to the maximum value. price[1] and price[_N] are explicit subscripting that tells Stata where to refer to.
Four methods of transforming numeric to categorical variables that we have come across so far: