Multicollinearity in regression

Hi everyone

I encountered a strange thing while performing a regression.

When I computed the variables for the regression I discovered some of them had a low Cronbach's Alpha reliability. So I decided to remove a few items from the questionnairs that they were computed from. Still, I computed both versions of the variables. E.g, one variable might have been computed from items 1-6 and the other might have been computed from items 1,2,4,6 (lets call these double v's). These variables had Pearson correlations of about 0.85.

Now, none of my predicting variables had a significant correlation with predicted variable. So I performed a backword regression, with all of my variables, expecting at least one version of the double v's to be left out of the model. This resulted in a significant regression with a good R square (although not all predictors were significant). However, the model included double v's - and for more than one variable. Unsurprisingly, VIF's levels were pretty high (although none were higher than 10) so I manually performed an enter regression with only one version of the double v's. This resulted in a non-significant regression with a very low R square, no matther which version of the double v's I included inside the model. Admittedly, my sample is quite small - only 41 observations with around 10 predictors.

So, I have two question:
1. How is this even possible?
2. Given all of the above, does it make sense to use the regression with the high R square?

Thank you for reading all of this!
Hi Shachar,

How did you compute the variable from several items 1-6 and from items 1,2,4,6?
When you say insignificant with only one of the 2 variables, what was the p-value?
(The power of the regression with medium effect (0.15) is low (0.25)

The difference between the significant model with good R and the insignificant with low R was only removing one variable of the 2? say the rest variables were the same?

ps, you should not worried about the multicollinearity since you "built" it.
Independently, the multicollinearity may have a big effect over the coefficients but not over the entire model.
Last edited:
Hi obh. Thank you for answering

The variables were computed simply by averaging them.

I have several regressions. In one, I have 2 pairs of such variables. For one pair, the first variable had p=0.001 and the second had p<0.001. For the other pair, the first had p=0.021 and the second had p=0.013. R square was 0.65 and the entire model had p<0.001.

After removing the first variable of each pair, the second variable from the first pair had p=0.675 and the second variable from the second pair had p=0.945. R square was 0.29 and the entire model had p=0.231.

I'm glad to hear that I shouldn't have to worry about using the model, but I didn't understand why. What difference does it make if I "built" it?

Anyway, I am still very curious about my first question. How these results are possible. And another thing - following your questions I tried removing a variable from only one pair and got a significant regression with a good R square. However, now some of the other variables are also not significant and removing them causes the entire model to go unsignificant again. So I'm a bit confused.
Hi Shachar,

I don't fully understand what you did, maybe show a simple example? Or the results.

"What difference does it make if I "built" it?"
I will give you an example:

With multicollinearity, the problem is not to predict y, but to know the coefficients,

Let's say that X1 and X2 have a high correlation, you don't know which of the following is correct:


So you can't say how a change in X1 will affect Y, but predication Y will be correct.

So in the following example:
You will have multicollinearity between X1 and x1^2 but this is not a problem because you can say how a change in X1 or in X2 will affect Y
Last edited: