Regression or regression alternatives for data missing not at random

I am analyzing the results of a satisfaction survey.

The research question or goal is to determine whether or not (dis-)satisfaction with performing certain activities (as measured by a 7-point satisfaction scale) on a system better predicts or explains overall satisfaction with the system (also measured on a 7-point satisfaction scale) than (dis-)satisfaction with other activities.

While each participant did provide a rating for the outcome variable (satisfaction with the system), the tricky part about the analysis is that not every participant provided a satisfaction rating for each activity (predictor). They were only asked to rate their satisfaction with an activity if they also self-reported performing that activity. This is leading to a lot of missing data.

For example, imagine the predictor variables are X Y and Z. The survey structure was essentially:

Q1. Have you used system ABC to do X? [Yes / No]

Q2. Have you used system ABC to do Y? [Yes / No]

Q3. Have you used system ABC to do Z? [Yes / No]

The participant would only receive the satisfaction (7-point scale) question for activity "X", "Y" or "Z" if they selected "Yes" to the corresponding questions above.

There were about 6 activities. Of the approximately 1900 participants, only around 150 provided a satisfaction rating for every single activity - so, with listwise deletion, only about 7% of the sample remains. In terms of the missing values, about 50% are missing. This, to my knowledge, is such a significant loss of data that techniques like multiple imputation are just not feasible -- coupled with the fact that I expect my missing data would be classified as "missing not at random".

Having said that, there are still a significant number of data points for each predictor variable - no less than 500 for each predictor - it's only that it's rare for any 1 participant to provide ratings for all of the predictors.

I feel that regression may simply be inappropriate and there may be no way to really resolve this problem if I want to include all or most predictors. However, if I'm mistaken I welcome any feedback.

What methods might be best suited for exploring how well satisfaction ratings with these activities best predicts overall satisfaction with the system given that each record may not have data for several predictors? I've read loglinear analysis may be a possible approach -- but I'm not familiar with that analysis. Alternatively, I've considered just doing basic correlations -- but this doesn't really compare the activities against one another in a model, of course.

Thanks in advance.


Less is more. Stay pure. Stay poor.
Per your description it does seems like this is a missing data problem at all. This is because the respondent didn't answer the parent question with a response that lead to the follow-up question. So correct me if I am wrong, but that is not missing data. If someone asked me if I was female Y/N, then had a follow-up question for females - that isn't missing data for males. Is this your scenario?

Well you could proceed with subgroup analyses. So if the question was female y/n, age > 18 Y/N; then you could say females 18+ both had,....,
Thanks for your response. As the participants were not given the opportunity to answer the satisfaction question -- yes, I would say you're right and that strictly speaking this is not a "missing data" problem. Your Male / Female examples summarizes the problem.

However, in the context of trying to develop a regression model, that many values which will be used as predictors are not available still affects the feasibility of the method.

Your suggestion (which seems to me to basically involve segmenting the data into different sub-groups (activities performed) and determining if the outcome variable (in my case, overall satisfaction)) does seem to make sense -- however, I'd suggest it doesn't really get at whether or not satisfaction or dissatisfaction with certain activities has a more significant effect on the outcome variable. Are there any other methods you would suggest? Would simply determining the correlation coefficients between satisfaction with each activity and overall satisfaction (which accounts for the missing data) be a valid approach?


Less is more. Stay pure. Stay poor.
Just to make sure we are following, can you present a small piece of your data (which can be made up if you want). This way we can know what you are working with. Is it something like:

Q1, Q1_s, Q2, Q2_s, Q3, Q3_s, Q4, Q4_s,..., Y

Y , 6, N , ., N, ., Y, 3, ????, how is Y formatted
Sure thing.

I've attached an Excel sheet that should answer this question.

Essentially, the participants were asked if they used a number of services (A,B,C,D,E,F). The formatting is binary (Yes or No).

If they indicated they used, for example, Service A, they received a satisfaction question for Service A (1-7 pt. scale). If they indicated not using a service, they did not receive a satisfaction question for it. These satisfaction scores are what I'd like to treat as the predictors.

Finally, regardless of the services they used / provided satisfaction ratings for, they were asked to rate their overall satisfaction with a company (1-7 pt. scale as well).

I hope this helps, and thank you.