URGENT: Omitted Dummy Variable /// Multicollinearity


New Member
Hi guys,

i'm a complete newbie when it comes to statistically evaluating data for my thesis and would therefore very much appreciate your help.

I have two questions:

1) My regression equation is including a vector of industrial dummy variables as independent variable (meaning three columns with A) SIC 20-39=1, else=0, B) SIC 50-59=1, else=0 and C) SIC 70-89=1, else=0). To avoid the dummy variable trap, I did not consider the dummy variable SIC 20-39 due to multicollinearity.
=> My question therefore is, how one can estimate the estimation coefficient and the t-statistics of the dummy variable SIC 20-39 if it is not included in the (OLS) regression?

2) When making the OLS regression (in SPSS) I also included some multicollinearity statistics, where I saw that two variables have a VIF of 45-50 which is very high and possibly led to the very low adjusted R-Square.
=> My question therefore is, what I can do to lower the VIF without taking out any variables? (I am exactly following the procedure of a well-known paper with my data where the VIFs were very low with the exact same variables).

I would be so grateful if you could help me with those two issues as my thesis deadline is approaching and google doesn't offer any good advice neither :(.


For the first question, in the output of your software, you will have the beta coefficients for B and C (and not A since we do not include it in the model). The coefficient B compares B*to A and the coefficient C compares C to A. For instance, if the coefficient B is equal to 1.5, then it means that when your dummy variables is in the B category, your dependent variable is 1.5 higher than when the dummy variables is in the A category.
To compare B to C, you can do :*coefficient B minus coefficient C.
The tests are always comparisons with A.

If you want to include all three, you can also omit the intercept. Then it will test if the groups are different from zero.

I don't know any other solution than taking out variables when you have so high VIFs. Hope someone will come out with a nice solution.


No cake for spunky
One thing to be careful about is that software actually has to at least two ways to compare dummy variables. The most common way, which SAS calls reference coding is to compare the levels of the dummy you left in to the level you left out. However, there is another way to compare dummy variables (which SAS calls effect coding). Here you compare the levels of the dummies to the mean of the means of the levels. The point is be sure how your software is calculating this in the documentation. SAS uses reference coding as the default, you need to know what your software is doing.

Including all levels of a dummy is perfect not multicolinearity. If you do this regression won't run at all and typically you will get an error message. There are a variety of ways of dealing with multicolinearity (not perfect collinearity). The most common (if you can't gather more data and can't drop a variable) is to collapse two or more of the variables into a single factor for example through factor analysis. Some suggest Ridge Regression although some strongly object to this. None of the solutions work easily and you should remember that multicolinearity has to be extremely high to actually have and impact and only influences the t test of individual variables not the overal model F test.

I suggest John Fox's "Regression Diagnostics" by Sage as a primer on this and other problems with regression. He has a variety of comments on solutions and their pitfalls.