Best way to impute NAs before PCA in R


I have a dataset with approximately 4000 rows and 150 columns. I want to predict the values of a single column (= target).

The data is on cities (demography, social, economic, ... indicators). A lot of these are highly correlated, so I want to do a PCA - Principal Component Analysis.

The problem is, that ~40% of the values are missing.

My current approach is:
  • Remove target indicator and do PCA with mean/median imputation of missing values.
  • Select x principal components (PC).
  • Append target indicator to these PC.
  • Use PC as predictors for the target variable and try common regression techniques, e.g. knn, linear regression, random forest etc.

With this approach, I'm getting quite good results. My metric is RMSE% - root mean squared relative prediction error. I tried this for all columns in the dataset, the RMSE% is between 0.5% and 8% (depending on the column). These errors are for values I actually know, NOT imputed values.

So, here's my problem: I'm not sure how much my data is distorted by replacing the missing values with the column mean/median. Is there any other way of imputing the missing values with minimal effect on the PCA results?

Any help would be highly appreciated.
Multiple imputations with mice or Amelia or ... package.
Thank you for your answer. Let's assume we are going with mice.

The result would be multiple imputed datasets (let's say 5).

How (or to be more specific: when) would I consolidate those datasets/the respective results?

Should I do 5 PCAs with each of the datasets?
Then I would have 5 different models and therefore 5 values for my estimated RMSE% at the end. Is the average of these the most realistic value to report?

And how would I go about actually predicting my target column? Taking the average of all 5 predictions?

Thank you so much.
The end goal is to use the Principal Components as predictors in a regression model (using methods like knn or linear regression methods in r like lm()). And for that I want to impute the missing values before performing the PCA in a statistically "correct" manner.

So the final process would be:

.) Delete target column y from dataset (to avoid "using" the result in the model)
.) Impute missing values
.) Perform PCA on completed dataset
.) Use first x Principal Components as predictors to predict the values of column y


Phineas Packard
Ah I did not see how large your dataset is. Your method in the OP seems sound though a CART based imputation method like that available in the caret package (I think it is in caret) might be better.
CART based imputation
CART stands for "Classification and regression tree", right? So you would suggest using the train() function of the caret package with e.g. "rpart" as method (overview of available tree models here

Could you give me some insight into your thinking? Why are you recommending CART now instead of multiple imputation? Because of the size of the dataset?

And what's your opinion on column-wise median/mean imputation before performing a PCA, given that the results look good? You meant that this method seems sound?

Thank you.


Phineas Packard
Sorry TS had an outage over the weekend. Ok so the problem with mean substitution etc. is that it assumes missing data are missing completely at random; which seems unlikely. MI preforms better as the assumption is missing at random and it even does ok when data are not missing at random. The difficulty is that with your size data the missing data model is going to be really slow and then you have to deal with the multiple imputations in all subsequent models. Hence some form of tree based approach.
Alright I understand. Handling multiple imputations would indeed be cumbersome.

Any more hints for the tree-based imputation? I do have between 100 and 200 columns, is it feasible and/or reasonable to use all of these columns in the model? Or should I look at correlations and only use the x most correlated columns for predicting the values in another column?

Thanks for your help.