Missing values of numeric variables are denoted by a period (.), the system missing value. Additionally, Stata provides ways to denote extended missing values: . a, .b, …,and .z.
The order is numbers< .< .a< .b< … < .z. Results of numeric expressions involving missing values will be missing values.
We should be careful when working with data containing missing values.
. sysuse auto
. list make rep78 if rep78 > 4
returns both car models whose repair records are larger than 4 and those that are missing.
+------------------------+
| make rep78 |
|------------------------|
3. | AMC Spirit . |
7. | Buick Opel . |
20. | Dodge Colt 5 |
43. | Plym. Champ 5 |
45. | Plym. Sapporo . |
|------------------------|
51. | Pont. Phoenix . |
53. | Audi 5000 5 |
57. | Datsun 210 5 |
61. | Honda Accord 5 |
64. | Peugeot 604 . |
|------------------------|
66. | Subaru 5 |
67. | Toyota Celica 5 |
68. | Toyota Corolla 5 |
69. | Toyota Corona 5 |
71. | VW Diesel 5 |
|------------------------|
74. | Volvo 260 5 |
+------------------------+
If we only want those with a record, we should use:
+------------------------+
| make rep78 |
|------------------------|
20. | Dodge Colt 5 |
43. | Plym. Champ 5 |
53. | Audi 5000 5 |
57. | Datsun 210 5 |
61. | Honda Accord 5 |
|------------------------|
66. | Subaru 5 |
67. | Toyota Celica 5 |
68. | Toyota Corolla 5 |
69. | Toyota Corona 5 |
71. | VW Diesel 5 |
|------------------------|
74. | Volvo 260 5 |
+------------------------+
Missing values of strings are denoted by “” without space.
To check for missing values we can use the following methods:
. if var > .
. if missing(var)
missing(x,y,z) checks if any of x, y or z is missing.
Sometimes it can be tricky to identify missing values. For instance, survey data usually coded the missing values in certain ways: -999 (not applicable), 88,888,888,888 (don't know), 99,999,999,999 (refused to answer) etc.
Make sure you read the codebook before you clean the data. If you treat the extremely large or small missing values as real cases, the results can be severely biased. Imagine if an income variable contained such missing values as 99,999,999,999, and someone was not being careful to have included those cases in his/her model. While in reality the respondent refused to answer how much his/her income was, in the models that person became super rich.
More on how Stata handles missing values in various scenarios:
UCLA: Statistical Consulting Group, Missing Values | Stata Learning Modules