Missing Data - Scale Validation

Hi everyone,

I am working to validate a 10-item scale, 5-point Likert response format in a sample of 673. Trying to find a "happy medium" for all I've been hearing about what to do with missing data. I am primarily using SPSS.

My adviser wants me to delete cases missing 3 or more items (2.5% of sample) and do mean substitution for the rest of the missing data. I've heard lots of conflicting info on what to do.

Six cases are missing all 10 items. Eleven cases are missing 3-9 items.

I ran the MCAR test which said there was a pattern in the missing data, and looking at it the pattern is pretty obvious. 52 cases are missing the same two items.

Missing data is also more frequent among people of certain ethnic groups, and I need to look at a couple of other variables to see if there are also differences there.

Could someone offer me a quick step by step approach on what I should do to address the missing data? My first thoughts were to:
1. Drop cases missing 3 or more items, then
2. Use multiple imputation in SPSS which I have not formally learned but this paper is due soon! I don't know where to start with it.

I've been told that it's OK to do individual mean substitution for those respondents who are missing two items or less, but something else would have to be done with people who miss more. Is it OK to use a few different methods for dealing with missing data in the same set? Delete those that are completely missing, use mean sub for missing 2 or less, and use imputation for people who miss more? This seems too complicated though.

Any help is appreciated!!!


Super Moderator

Some thoughts:

1) I agree with your gut feeling that using several different methods to deal with missing data for the same dataset seems suboptimal. In fact, if you're going to use a decent missing value imputation method like multiple imputation for at least some of the cases, I'm not sure what benefit there would be in using mean imputation for some of the others.

2) Mean imputation is a very crude imputation technique; for one thing it reduces the variance of your variables. It is something you could perhaps argue for using if there was only a tiny quantity of missing data. I would not use it in this case.

3) I can understand the argument for dropping cases with LOTS of missing data, but do think about whether there is other information available in your wider dataset that could help you to make reasonable imputations for these values. I.e. if you have lots of other measured variables, you could still use multiple imputation; values for these people would be imputed primarily based on information found in variables other than the Likert scale items.

4) Multiple imputation is probably the best way of imputing missing data, and from what I understand SPSS does make producing and using MI data reasonably straightforward.

5) that said, spunky might argue that you are better off using latent variable modelling techniques that do not require imputation at all and instead allow coefficients to be estimated in the presence of missing data (e.g., FIML using lavaan in R). How much this appeals to you may depend on what kind of substantive analyses you were planning. E.g., are you mainly interested in doing SEM/CFA....


Phineas Packard
I agree with CB. Your results are going to be more biased rather than less i you drop cases. I would strongly recommend multiple imputations or FIML over the approach your supervisor suggested. I would start by reading Craig Enders book "Applied Missing Data Analysis" and Little and Rubin's book on missing data.
Thank you all for your feedback on this issue I've been having. I would like to try multiple imputation...mind if I ask one more question?

When you conduct MI, are all the variables in the dataset used? The dataset I'm using is an old one that I did not collect. It has a couple hundred variables that I will not be using in my analysis, so I haven't assessed them to see how much cleaning they need...I'm sure there is plenty to do. Must the entire dataset be used (minus string variables, ID, etc), or can it be limited to the variables I suspect might influence responses on the scale? Probably a silly question here but I'm wading through a lot of sources that seem to go into more detail than my brain can handle!

As for planned analysis I am doing principal components analysis on a polychoric matrix, which is again against supervisor's advice... :eek:

Thank you again for the feedback!


Super Moderator
I think the question of whether to use all available data for imputation is a tricky one. Using it all will/should result in more accurate imputations. But on the other hand, computation time might become a problem. There's also the issue that when you write the study up, you might not want to go into detail about all this extra data collected, yet if you use it for imputation the extra data does have some bearing on your results.

I also am unsure what to do in similar situations sometimes, so I'd be interested to hear what others here usually do!