Missing Data Imputation - How to choose the best method?


I found a toolbox with several methods for missing data imputation, and they perform a Mean Squared Prediction Error criteria do know which method is the best for a % of missing data.

They start from a complete data set (no missing data), they create a missing data, they impute the missing data, and they calculate the MSPE.

My problem is my original data, has already missing data (industrial data), so I cant perform MSPE between complete data and fitting a PCA model to impute the missing data.

I would like some help to know how can I choose the best method.

the toolbox is from Arteaga and Ferrer "PCA model building with missing data: new proposals and a comparative study" + "Missing data imputation: Toolbox for Matlab"

Thanks for your time


Omega Contributor
Yes, data imputation is typically based on your content/ domain knowledge of the subject and mechanism for missingness. Little has a test for data missingness, but most everything comes back to you on whether data is MCAR, MAR, or, NMAR. Regardless of the approach you take, most agree that multiple imputation is a preferred method in that it addresses your uncertainty about imputed data. Lastly, choice also comes down to also how the data is formatted (e.g., continuous, categorical). I am not overly familiar with whether certain methods perform better based on the amount of missingness. But trying to understand the mechanism and whether your have enough data to input imputed data given the mechanism cause is important.
The MI approach is a more accurate approach to adapt to fill in missing data. Data missing ness is an everyday thing and a robust algorithm like the EMB can be used with some good packages like R or Stata. To further buttress the previous writer, it depends the data you are using as well.