Variance inflation and interactions between variables?

enur

New Member
#1
Hello everyone

I am having trouble interpreting some of my results.
I am using logistic regression to infer a model based on measured data. Some of my explanatory variables are continuous (e.g. temperature [°C]) and some are categorical (e.g. time of day [night, morning, day, afternoon, evening]). To investigate multicollinearity issues, I have calculated Generalized Variance Inflation Factors (GVIF) in R (using the CAR package). R automatically calculates GVIF^(1/(2Df)), which to my understanding is an estimate of the factor by which the confidence interval of each coefficient is inflated (please correct me if I am wrong).

My problem is: How should I interpret the GVIF of interaction terms between continuous and categorical variables?

One of my simple models looks like this:

Code:
Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -8.76208    2.54158  -3.447 0.000566 ***
Temperature                0.08847    0.11441   0.773 0.439363    
timeMorning                3.20504    2.82524   1.134 0.256614    
timeDay                    0.72913    2.77043   0.263 0.792409    
timeAfternoon             -0.34141    2.77430  -0.123 0.902057    
timeEvening               -0.97397    3.16012  -0.308 0.757926    
Temperature:timeMorning   -0.02669    0.12782  -0.209 0.834601    
Temperature:timeDay        0.06239    0.12415   0.503 0.615302    
Temperature:timeAfternoon  0.09116    0.12386   0.736 0.461711    
Temperature:timeEvening    0.06535    0.13907   0.470 0.638410
with the following GVIFs

Code:
                        GVIF Df GVIF^(1/(2*Df))
Temperature      2.091206e+01  1        4.572971
time             1.595779e+08  4       10.601604
Temperature:time 1.899285e+08  4       10.834872
I would like to make a table like the one below:

Code:
                  Estimate     std.Dev   std.Err   C.I. 2.5%   C.I 97.5%   Inflation
Intercept        
        Night     -8.76208     2.54158   0.010494  -8.78       -8.74           XX
        Morning   -5.55704     3.800212  0.01569   -5.59       -5.53           XX
        Day       -8.03295     3.759642  0.015523  -8.06       -8.00           XX
        Afternoon -9.10349     3.762495  0.015535  -9.13       -9.07
        Evening   -9.73605     4.055365  0.016744  -9.77       -9.70
Temperature
        Night      0.08847     0.11441   0.000472  0.0875       0.0894
        Morning    0.06178     0.171545  0.000708  0.0604       0.0632
        Day        0.15086     0.168828  0.000697  0.1495       0.1522
        Afternoon  0.17963     0.168615  0.000696  0.1783       0.1810
        Evening    0.15382     0.180084  0.000744  0.1524       0.1553
My problem is: How do I calculate the Inflation of the confidence intervals?

I would really appreciate if anyone can help!!
 
Last edited:

Jake

Cookie Scientist
#3
You should not think too hard about the VIFs in this scenario. If you do not center your predictors (and it looks like you haven't!), then there will almost always be apparently extreme multicollinearity between the interaction term and the simple effect terms. This makes sense: if you have an interaction term A*B, it should not be surprising that this is highly correlated with A, because half of what comprises A*B is A itself!

However, this multicollinearity is a red herring. It is an artifact of having not centered your predictors and does not actually inflate your confidence intervals to an undue degree.

To see this, take a look at the formula for the \((1 - \alpha)\)% confidence interval of the coefficient \(\beta_j\) for a predictor \(X_j\):

\(b_j \pm \sqrt{\frac{(F_{1,n-p;\alpha})(MSE)}{(SSX_j)(TOL_j)}}\)

\(F_{1,n-p;\alpha}\) is the critical value of F, \(MSE\) is the mean squared error of the model, \(SSX_j\) is the variance of the predictor \(X_j\) (technically the sum of squared errors: \(SSX_j = s_j^2(n - 1)\)), and \(TOL_j\) is the "tolerance" of \(X_j\), which is just \(\frac{1}{VIF_j}\).

As you can see, as the tolerance decreases (conversely, as the VIF increases), the confidence interval expands. (This also answers your question about what exactly the inflaction factor is -- the confidence interval expands with the square root of the VIF.) However, in the situation of predictors that are products of uncentered variables, it turns out that this decrease in tolerance caused by not centering the predictors is counterweighed by an increase in the variance of the predictor, \(SSX_j\), so that these two effects cancel out and the width of the confidence interval is net unchanged.

The following tables might help to illustrate the effect of centering on both multicollinearity and variance:

Uncentered
Code:
> uncen
      x1 x2 x1x2
 [1,]  7  8   56
 [2,]  4  6   24
 [3,]  9  9   81
 [4,]  6  8   48
 [5,]  6  9   54
 [6,]  6  5   30
 [7,]  6  9   54
 [8,]  6  1    6
 [9,]  8  3   24
[10,]  5  9   45
> 
> # correlations
> cor(uncen)
               x1           x2      x1x2
x1    1.000000000 -0.002730559 0.4655388
x2   -0.002730559  1.000000000 0.8715734
x1x2  0.465538807  0.871573389 1.0000000
> 
> # variances
> apply(uncen, 2, var)
        x1         x2       x1x2 
  2.011111   8.233333 459.733333
Centered
Code:
> cen
        x1   x2  x1x2
 [1,]  0.7  1.3  0.91
 [2,] -2.3 -0.7  1.61
 [3,]  2.7  2.3  6.21
 [4,] -0.3  1.3 -0.39
 [5,] -0.3  2.3 -0.69
 [6,] -0.3 -1.7  0.51
 [7,] -0.3  2.3 -0.69
 [8,] -0.3 -5.7  1.71
 [9,]  1.7 -3.7 -6.29
[10,] -1.3  2.3 -2.99
> 
> # correlations
> cor(cen)
               x1           x2      x1x2
x1    1.000000000 -0.002730559 0.1632143
x2   -0.002730559  1.000000000 0.1961749
x1x2  0.163214305  0.196174937 1.0000000
> 
> # variances
> apply(cen, 2, var)
       x1        x2      x1x2 
 2.011111  8.233333 10.530667
 

enur

New Member
#5
Thank you for your replies – I really appreciate it!

Hlsmith: yes when I wrote categorical I meant ordinal - I often make this mistake (I don’t know why).

Jake: I am not really sure what you mean by centering of predictors. Maybe I was not clear enough about my data.
I have been measuring temperature in residential buildings (and some other variables) for a period of time. The variables (including temperature) were measured on 10 minute intervals. Based on the time of day, I have created an ordinal variable called time [night, morning, day, afternoon, evening].
I have also recorded different events (on/off) in the buildings. My aim is to create models, which can predict events, based on the measured variables. I have used logistic regression and stepwise forward and backward selection of variables to infer the different models (the selection was based on AIC). The model in my example was the simplest I could think of. Most of the inferred models include more variables.

I would like to calculate the possible inflation of the confidence intervals due to collinearity, so I (and others) can be aware of this in the future, when I start using (and validating) the models.

If I understand it correctly, a GVIF^(1/(2DF)) of 10.6 for the variable ‘time’ means that the effects of ‘time’ on the intercept may be inflated to such an extent, that the inferred confidence intervals for the intercept may be up to 10.6 times too large. A GVIF^(1/(2DF)) of 4.6 for the variable ‘Temperature’ means that the confidence interval for the ‘Temperature’ coefficient may be up to 4.6 times too large, as compared to the case with no multicollinearity. My problem is that I have interactions between ‘time’ and ‘Temperature’ resulting in five different coefficients for the variable ‘Temperature’. How do I interpret a GVIF^(1/(2DF)) of 10.8 for the interaction between Temperature and time?
Can I simply add the GVIFS, so that the temperature confidence intervals may be 4.6+10.8=15.4 times too large?

Any insights are highly appreciated!
 

Jake

Cookie Scientist
#6
Jake: I am not really sure what you mean by centering of predictors. Maybe I was not clear enough about my data.
Yes, I think I understand the example. To "center" a predictor means to subtract off the mean value of that predictor from all the individual values, so that the new mean is 0. Observe the values of x1 and x2 in the first code block that I posted and compare them to the values of x1 and x2 in the second code block.
 
#7
At risk of hijacking (and asked here to avoid the risk of making crap threads with 1 liner questions):

If there are two regressors in a 5 regressor cross-sectional regression that have correlation (pearson) = 0.5 with p-value < 0.001, is this bad? I have about 350 observations in the cross-section and other Gauss-Markov assumptions are in tact.
 

Jake

Cookie Scientist
#8
Probably not. What are the VIFs?

Note that even when multicollinearity is a big problem, it's really only a "problem" from the perspective of having a negative influence on power. There is no "assumption" of non-collinearity to be violated. It just works out more nicely to have the predictors be close to orthogonal.
 

noetsi

Fortran must die
#9
An old thread, but one I have a question on. If I understand Jakes comment...

As you can see, as the tolerance decreases (conversely, as the VIF increases), the confidence interval expands. (This also answers your question about what exactly the inflaction factor is -- the confidence interval expands with the square root of the VIF.) However, in the situation of predictors that are products of uncentered variables, it turns out that this decrease in tolerance caused by not centering the predictors is counterweighed by an increase in the variance of the predictor, , so that these two effects cancel out and the width of the confidence interval is net unchanged.
correctly then while VIF will likely indicate multicolinearity for interaction terms [and possibly the main effects associated with this] the multicolinearity will not effect the test of statistical signficance [through the standard errors] as it would normally. I assume this is because there really is no actual multicolinearity in this case it is only the VIF test being distorted [although I am not certain of this from the post].

I assume Josh means main effects when he mentions simple effect terms in his post.