Multiple Imputation and Subgroup Analyses

#1
Hi all,

I have a dataset with missing data, and I want to do analyses on the whole dataset, as well as specific subgroups.
I am now wondering whether I can use the complete dataset to do the multiple imputation and then perform all analyses including subgroup analyses on this multiple-imputed dataset, or do I need to impute separately for each subgroup (i.e., delete all observations first that do not belong to the subgroup and then impute).

Thank you and best regards,
Patrick
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
More data is usually better. Your description was a little terse, but I would say you are fine imputing based on full dataset. Unless I am missing some way this would introduce systematic error.
 

noetsi

Fortran must die
#3
I would guess you would impute into the whole data set although I have never seen this issue raised. The less data you have I think the harder imputations would be to do.
 
#4
Thanks for your replies!

the reason why I was raising this question is because I am expecting the association between the exposure and outcome that I am studying to differ depending on the subgroups. So, there is an implicit exposure-by-subgroup interaction. I think it would be important to model this interaction in the imputation model, but when doing so, the model does not converge.
When I leave the interaction out, I can easily impute but I am not sure whether it would then be appropriate to perform subgroup analyses on that imputed dataset. So I thought maybe best to separately impute for each subgroup?

Best,
Patrick
 

noetsi

Fortran must die
#5
I am not an expert in this issue, but I would first look at the literature on why things don't converge and see if you can address this first. I personally have not experienced this much except for multilevel models.
 

hlsmith

Less is more. Stay pure. Stay poor.
#7
Well that issue makes more sense. Yeah, I agree, that trying to get the model to run with the interaction would be the first step. Perhaps play around with using fewer variables, etc. to ensure you have the code written correctly and that it can run.

I would imagine that if you have heterogeneity in the treatment effects you could run them individually. Perhaps also set the seeds to the same values in all of the models. You could also run the desired outcome model and score the variable and see how different the imputed values are off from the scored values, given you are imputing the outcome. Not sure this is needed, but I am typically a fan of trying to create quality checks.