Missing Values

Numeric variables

Missing values of numeric variables are denoted by a period (.), the system missing value. Additionally, Stata provides ways to denote extended missing values: . a, .b, …,and .z.

Missing values take the largest positive values

The order is numbers< .< .a< .b< … < .z. Results of numeric expressions involving missing values will be missing values.

We should be careful when working with data containing missing values.
. sysuse auto
. list make rep78 if rep78 > 4
returns both car models whose repair records are larger than 4 and those that are missing.


     +------------------------+
     | make             rep78 |
     |------------------------|
  3. | AMC Spirit           . |
  7. | Buick Opel           . |
 20. | Dodge Colt           5 |
 43. | Plym. Champ          5 |
 45. | Plym. Sapporo        . |
     |------------------------|
 51. | Pont. Phoenix        . |
 53. | Audi 5000            5 |
 57. | Datsun 210           5 |
 61. | Honda Accord         5 |
 64. | Peugeot 604          . |
     |------------------------|
 66. | Subaru               5 |
 67. | Toyota Celica        5 |
 68. | Toyota Corolla       5 |
 69. | Toyota Corona        5 |
 71. | VW Diesel            5 |
     |------------------------|
 74. | Volvo 260            5 |
     +------------------------+

		
If we only want those with a record, we should use:
. list make rep78 if rep78 > 4 & !missing(rep78)

     +------------------------+
     | make             rep78 |
     |------------------------|
 20. | Dodge Colt           5 |
 43. | Plym. Champ          5 |
 53. | Audi 5000            5 |
 57. | Datsun 210           5 |
 61. | Honda Accord         5 |
     |------------------------|
 66. | Subaru               5 |
 67. | Toyota Celica        5 |
 68. | Toyota Corolla       5 |
 69. | Toyota Corona        5 |
 71. | VW Diesel            5 |
     |------------------------|
 74. | Volvo 260            5 |
     +------------------------+

		

String variables

Missing values of strings are denoted by “” without space.

Identifying missing values

To check for missing values we can use the following methods:
. if var > .
. if missing(var)

missing(x,y,z) checks if any of x, y or z is missing.

Be careful with the missing value indicators

Sometimes it can be tricky to identify missing values. For instance, survey data usually coded the missing values in certain ways: -999 (not applicable), 88,888,888,888 (don't know), 99,999,999,999 (refused to answer) etc.

Make sure you read the codebook before you clean the data. If you treat the extremely large or small missing values as real cases, the results can be severely biased. Imagine if an income variable contained such missing values as 99,999,999,999, and someone was not being careful to have included those cases in his/her model. While in reality the respondent refused to answer how much his/her income was, in the models that person became super rich.

More on how Stata handles missing values in various scenarios:
UCLA: Statistical Consulting Group, Missing Values | Stata Learning Modules

Author: Yun Dai, 2018