I have been attempting to use multiple imputation (MI) to handle missing data in my study. I use the mice package in R for this. The deeper I get into this process, the more I realize I first need to understand some basic concepts which I hope you can help me with.

For example, let us consider two arbitrary variables in my study that have the following missingness pattern:

Variable 1 available, Variable 2 available: 51 (of 118 observations, 43%)

Variable 1 available, Variable 2 missing: 37 (31,3%)

Variable 1 missing, Variable 2 available: 10 (8,4%)

Variable 1 missing, Variable 2 missing: 20 (16,9%)

I am interested in the correlation between Variable 1 and Variable 2.

Q1. Does it even make sense for me to use MI (or anything else, really) to replace my missing data when such large fractions are not available?

Plot 1 provides a scatter plot of these example variables in the original data. The correlation coefficient r = -0.34 and p = 0.016.

Q2. I notice that correlations between variables in imputed data (pooled estimates over all imputations) are much lower and less significant than the correlations in the original data. For this example, the pooled estimates for the imputed data show r = -.11 and p = 0.22.

Since this seems to happen in all the variable combinations that I have looked at, I would like to know if MI is known to have this behavior, or whether this is specific to my imputation.

Q3. When going trough the imputations, the distribution of the individual variables (min, max, mean, etc.) matches the original data. However, correlations and least-square line fits vary quite a bit from imputation to imputation (see Plot 2). Is this normal?

Q4. Since my results differ (quite significantly) between the original and imputed data, which one should I trust?

Thank you for your help in advance.

Tina