Multiple regression. Keep or remove variables?


New Member
Hi, everyone! Could you, please, help me to resolve my confusions on multiple regression?o_O

Will try to keep it as short as I can.
During the survey (online questionnaire) were collected data on satisfaction by service of the company (bank), respondents were asked to measure overall – satisfaction (dependent, 6-point Likert scale, where 1-N/A, 2-Strongly dissatisfied; 6-Very satisfied) and performance of 10 attributes (independent, 5-point Likert scale, 1- Strongly disagree, 5- Strongly agree). Number of observations = 213.
It looked straightforward from the beginning, that high performance of attributes should lead to higher satisfaction… But… I feel like something fishy is going on and maybe I should not rely on collected data. Can you, please help me to check the data/model fit? Below info on what I did and how I thought, maybe some of my steps are wrong?.. Please…:oops:
1 step. Run Multiple regression for 10 attributes (spss).
R square= 0.373, quite low. Significance for 6 attributes out of 10 is much higher than it should be.
Max Mahalanobis distance is 53.338, which is also unacceptable. (meanwhile cronbach alpha for the whole dataset = 0.868)
For this regression were excluded 3 responses, for satisfaction = N/A, but it didn’t help… I was also surprised to see that according to the coefficient, higher performance on Accuracy of banks (no mistakes during transactions) and better customer service have the opposite effect on overall satisfaction… due to high significance, probably it cannot be trusted.
And residual plot looks weird as well.

2 step. Multiple regression for 10 attributes with dummies.
Residual plot made me think that maybe the problem is in scales, not in data… So I’ve made dummy variables for performance of attributes. Average level (performance =3) was taken as reference, so I coded Low performance (1-2) and High performance (4-5).

And… It didn’t help. R square= 0.342 and almost everything is insignificant.
3 step. Remove attributes.
I divided dataset on 2 subsets based on geographical location (hoped that it would help), ran regression, didn’t help.
With a heavy heart I’ve removed all attributes which shown insignificance and end up having this:
For the first dataset (in Belgium) I ended up with 3 attributes: Speed, Friendliness and Regulation (R square 0.412).
For the second dataset (Russia) with only 2 – Trust and Friendliness (R square 0.389).

Is it the only way to go? By removing almost everything? :oops::confused:
What is also confusing, that implicit importance of attributes is quite high and doesn’t correspond with the result of the regression.
Please, guide me in the complex world of statistics. I desperately want to understand what's wrong...


Fortran must die
You probably should be running ordered (ordinal) logistic regression (I am not sure what you are running since multiple regression can mean many things). Your DV is not really interval as linear regression requires since it has only 6 levels. However, that is probably enough for linear regression to work in practice.

Based on your residual's something is likely wrong. I think you probably are missing a variable from your model, but that is simply a guess. I know nothing of your research area.

You should go with what theory in this area says...if there is theory. Dropping out variables simply because they are not significant is not recommended if you have theory that says they matter. You should print the results and simply say they did not matter statistically.


Ambassador to the humans
Based on your residual's something is likely wrong. I think you probably are missing a variable from your model, but that is simply a guess. I know nothing of your research area.
I'm just curious why you think they are missing a variable based on the residuals?