Correlation and multiple imputation


New Member
Dear all,

I have been attempting to use multiple imputation (MI) to handle missing data in my study. I use the mice package in R for this. The deeper I get into this process, the more I realize I first need to understand some basic concepts which I hope you can help me with.

For example, let us consider two arbitrary variables in my study that have the following missingness pattern:

Variable 1 available, Variable 2 available: 51 (of 118 observations, 43%)
Variable 1 available, Variable 2 missing: 37 (31,3%)
Variable 1 missing, Variable 2 available: 10 (8,4%)
Variable 1 missing, Variable 2 missing: 20 (16,9%)

I am interested in the correlation between Variable 1 and Variable 2.

Q1. Does it even make sense for me to use MI (or anything else, really) to replace my missing data when such large fractions are not available?

Plot 1 provides a scatter plot of these example variables in the original data. The correlation coefficient r = -0.34 and p = 0.016.

Q2. I notice that correlations between variables in imputed data (pooled estimates over all imputations) are much lower and less significant than the correlations in the original data. For this example, the pooled estimates for the imputed data show r = -.11 and p = 0.22.

Since this seems to happen in all the variable combinations that I have looked at, I would like to know if MI is known to have this behavior, or whether this is specific to my imputation.

Q3. When going trough the imputations, the distribution of the individual variables (min, max, mean, etc.) matches the original data. However, correlations and least-square line fits vary quite a bit from imputation to imputation (see Plot 2). Is this normal?

Q4. Since my results differ (quite significantly) between the original and imputed data, which one should I trust?

Thank you for your help in advance.


TS Contributor
Hi Tina,

First, let me state that Multiple Imputation is a valid procedure even when you have enormous proportions of missing observations. I've seen applications when there are over 60% or 70% data points missing.

On the other hand, the results of MI will rely heavily in the model you use to impute. As you should know, to perform a good Multiple Imputation you must use some other variables that may help you describe the missing observations. It is important that you include any information that may be useful, including design factors (strata, clusters, id's on panel data, etc.). If you don't have a good model then the imputation procedure may not be appropriate. Regarding the discrepancy in your results, one wise advice would be to include both an analysis with and without imputation and discuss the differences. By the way, if you end up with too few observations for a completed case analysis, try using non-parametric correlations and analyze the difference between those.

Hope this helps