Hopefully this post will be preaching to the choir, but lately I’ve been noticing that many students, programmers, professors, etc. write their statistical code as though there were no such thing as missing data. Unless your data have been cleaned and missing values imputed (I think the Public Use Census Data are like this), overlooking missing data is a mistake that could lead to, inter alia, biased estimates that act like measurement error.

The most common error that I see occurs in the creation of new variables. Variables are created according to rules regarding other variables (e.g., if x>5, set y=1). But what happens if x is a missing value? Do you know how your statistical software treats missing values? I’m going to go through this example using Stata, but the same logic applies to any programming environment.

I have created a random variable c that takes 3 values (1, 2 and 3). There are 3 missing observations for the variable c. I want to generate a new variable that takes the form

NewVar_i=\begin{cases} 1 & \text{if } c_i \in {1,2}  \\ 0 & \text{if } c_i =3 \end{cases} .

Consider the following two approaches to creating this variable:

Approach 1: Initialize everything to zero

. gen x=0
. replace x=1 if (c==1 | c==2) (9 real changes made)

Approach 2: Initialize everything to missing

. gen y=.
(25 missing values generated)
. replace y=1 if (c==1 | c==2)
(9 real changes made)
. replace y=0 if c==3
(13 real changes made)

Results

The variable x, created using approach 1, has too many observations. There should be 22, but there are 25:

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
           x |        25         .36    .4898979          0          1
           y |        22    .4090909    .5032363          0          1

The variable y, on the other hand, has the correct number. The difference arises because what I want to do is define NewVar to be 0 whenever c is 3. Approach one defines NewVar to be 0 whenever c is not in {1,2}. This is a big difference from an econometric/statistical perspective. If we define NewVar to be 0 when we do not know the value of c, we are essentially imputing values. These imputed values will reduce the accuracy of estimates and future calculations.

A note about missing values in Stata

Stata saves missing values as very large numbers. This means that if you used the command

replace y=0 if c>2

Stata would set y=0 for observations with c=3 as well as observations with c=. (missing). In this case, you could just use the command in Approach 2 above. For continuous variables, however, this is not an option. The optimal workaround, in my opinion, is to use the following:

replace y=0 if c>2 and !missing(c)

This would set y=0 for all observations where c is not missing and the value of c is greater than two.


  • Quick navigation

  • Categories