Logistic regression and correlation

#1
Hi
I'm building an account management scorecard with logistic regression. Some of the variables have quite large correlations, but they get selected into the same model (thus the effect of the correlation does not explain all the variance). According to Siddiqi (Credit risk scorecards) the effects of multicollinearity can be overcome by using a sufficiently large sample. My questions are:
1. Is this correct (i.e. can I ignore the correlations)?
2. How big can the correlation be to still be acceptable in the model?
3. How big is a sufficiently large sample?
Thanks a lot
 

Link

Ninja say what!?!
#2
Some questions for you:
1. what do you mean by "they get selected into the same model"? How are they being selected?
2. what do you mean by "the effect of the correlation does not explain all the variance"? PS. You'll almost never come across a model using real world data that "explains all the variance"
3. When you really are dealing with multicollinearity, yes you can overcome it with a sufficiently large sample.

Answers:
1. Yes. However, whether you can ignore the correlations will also depend on a few other factors. I'll go into more detail once my questions are answered.
2. It depends. I wouldn't say there's a set size per se. However, anything close to 1 or -1 will raise lots of red flags.
3. Again, it depends. How big is the correlation and how big are the effect sizes?
 
#3
Thanks for your response. In answer to your questions:
1. I use proc logistic in SAS with the selection = stepwise option to select variables.
2. Suppose variable 1 and variable 2 are highly correlated. If var 1 is in a model and var 2 is entered into that same model, var 2's significance will probably be much lower than that of var 1, since var 1 contain a large part of the information found in var 2. However, of var 2 also shows a high significance, then that correlation does not explain all the effects contained in those two variables. Thus, value is added by including the second variable into the model. (I hope you understand what I am trying to say)

I am working with correlations of about 0.75. Do you think that is too much?
 
#4
Hi,
I think its a big correlation if you talking about partial correlation.

Can you quote you sas proc for this, its quite rare (if no lurking variables!) that variables with such partial correlation been both significance, quote p-values too please.
 

Link

Ninja say what!?!
#5
Stepwise regression is very data adaptive. The resulting model's only as good as the data you have, regardless of your hypothesis and theory. Though this method sort of indirectly addresses multicollinearity, it does not take care of it.

I agree with your sentence: ...if var 2 also shows a high significance [in addition to var 1], then that correlation does not explain all the effects contained in those two variables. Thus, value is added by including the second variable into the model.

Working with a correlation of about 0.75 is complicated. In my view, it's high enough where you will see problems with regression, but also low enough where you really have to consider if you want to drop one of the variables.

To learn more about this and remedies, please look up multicollinearity in wikipedia.
 
#6
Proc logistic gives you AIC's and other information criteria by default. Can you try different models and see if AIC tells you to chose the same model that the stepwise function picks? (BTW From what I've read, AICs are more helpful when you can throw out as many unrealistic models as possible before comparing the AICs of the remaining ones). I'm not sure how the stepwise feature in proc logistic works so this might be the same thing, but you could compare the -2 log likelihoods of the models with and without the highly correlated 2nd variable and do a 1df chi squared test to see if adding variable2 explains significantly more variation in your response variable.