# Imputing missing values, and simulating the dependent variable in a regression

#### eyesack_kn

##### New Member
Hi all,

I've been working on filling in some missing income values in a data set using multiple imputation. To check the validity of my model, I randomly selected a few hundred people whose income I knew and set them to missing. When I compared their mutliply imputed incomes to their true incomes, I found that low incomes were consistenly over-estimated, while high incomes were consistently under-estimated; in other words, the true values of income were positively correlated with the difference between the actual and imputed values. This despite the fact that i can't reject that the means and variances of the simulated values are different from the true values. Sure, my MI model could be wrong, or the missingness could be correlated with unobservables, but just to be sure, I ran the following simulation:

Let the data-generating process for y be given by $$y_i := a + x_ib + e_i, e_i \sim N(0,\sigma^2)$$
Then obtain $$\hat{y}_i=\hat{a}+\hat{b}x_i$$
Now simulate another error term, $$e_i^{sim}\sim N(0,\sigma^2)$$
and define $$y_i^{sim}=\hat {y}_i+e_i^{sim}$$
Finally, let $$\tilde{y}_i = y_i - y_i^{sim} \iff \tilde{y}_i = e_i-e_i^{sim}$$

Where for the sake of simplification I assume $$b=\hat{b} \mbox{and} a=\hat{a}$$. (A simpler way to do this is just to simulate two set of Ys using two different error terms, but the above corresponds roughly to an imputation procedure)

But now the deviation of the simulated y from the true y is positively correlated with the error term. As a result, I observe that simulated values of y are too high for low values of y and are too low for high values of y, but are spot on for values of y near the mean. I imagine this must be a well-understood phenomenon in the literature on simulation, but can anyone point me to some relevant, edifying sources?

Thanks!