Logistic Reg Complete Seperation

hlsmith

Omega Contributor
#1
Hey y'all,

I was running a stacked ensemble (weighted combo of base machine learners) model yesterday (R: H20: autoML). And I noticed the top contributing model had an accuracy of 99.995%. It was a gradient boosted model (per random grid search) for a classification problem. I thought hey maybe the model really is that good, since I am not over familiar with gbm. So today I wanted to check it out and ran a logistic reg on the problem and got a complete separation error. I am on a laptop and using R, which I am not great on the little computer or R. So I toyed around with the model by excluding a single covariate at a time in logistic reg. I notice when either X1 or X5 were excluded the model would run with out error. So I then generated the three different 3-way contingency tables (since variables were binary: Y, X1, X5) and did not notice any null cells in the tables.

What do you all think for investigating this?

P.S., Side note, the model is actually modeling the missingness of a variable in the dataset, so (prob(missing(y/n) | X)
 
Last edited:

hlsmith

Omega Contributor
#2
I was just thinking if the covariates were continuous, i would be able to be to see the separation like in support vector machines, so shouldn't I be able to see the issue in data via the contingency tables, and wouldn't there just be 3 unique contingency tables per stratifying by 1 of 3 other variables?
 

spunky

Doesn't actually exist
#3
I notice when either X1 or X5 were excluded the model would run with out error. So I then generated the three different 3-way contingency tables (since variables were binary: Y, X1, X5) and did not notice any null cells in the tables.
Ok, I'm gonna preface this by saying I'm not really an expert on this but I always assumed that when you found separation in a logistic regression setting, it could basically come up from any n-way table that includes the suspect variables. Like maybe X1 and X5 play a role in this, but only in presence of say, X2. So a 3-way table that doesn't include X2 will not show the problem. If your number of regressors is not large, you can maybe try and all-possible subsets ensuring that the suspect covarites are present and inspect those contingency tables.
 

hlsmith

Omega Contributor
#4
@spunky thanks. Good point. I will also not that the model y~X1+X5 will run, which supports your comment. Tibshirani may have a subset package, though it could end hurting the laptop :( However, I think it uses some type of optimization to address outrageuous large combinations. I have ten predictors I believe. To the internet...
 

hlsmith

Omega Contributor
#5
Well R has a package for detecting separation, brglm2. I got the following which I am gonna have to think about since it says the same thing I was about these data.

_____________________________________________________________
Separation: TRUE
Existence of maximum likelihood estimates
(Intercept) X1 X2 X3
-Inf -Inf 0 0
X4 X6 X7 X8
0 0 0 0
X5 X9 X10
Inf 0 0
0: finite value, Inf: infinity, -Inf: -infinity
______________________________________________________________


So am I not taking the intercept into value when trying to visualize , so all of the reference groups plus X1 and X5. Or does the intercept go -Inf just because of the other (-)inf?
 

hlsmith

Omega Contributor
#6
Could this actually be the issue? Holding all other covariates at reference value you get 1 subject with Y=0, X1=0, X5=0. Hmm. confusing myself now.

Edited: Ran this in SAS and it calls it a
"possibly a quasi-complete separation of data
points."
 
Last edited:

spunky

Doesn't actually exist
#7
Or does the intercept go -Inf just because of the other (-)inf?
I think that is the case. Remember that the intercept in regression models is a function of the mean of the covariates and the regression coefficients. If one of them is Inf then any arithmetic function of that would result in an Inf too.
 
#9
I'm pretty sure separation refers to the sparsity of outcomes in levels of an IV. For example, all males had heart attacks (1) and all females did did not (0) so M/F tells you MI vs not.
Quasi-separation is where almost all of the cases per level have the same outcome (98% males had MI, (97% females did not).

Basically sparsity of outcomes in a level of a categorical independent variable.
 

hlsmith

Omega Contributor
#10
Yes @spunky and @ondansetron - I am familiar with the definition. I think my issue was the R error is likely the same for "complete" and "quasi", so I was looking for complete with the tables. But then SAS better articulated that it was "quasi". When I looked at the potential sparsity their is one cell with a value of 6 in the smaller of the outcome groups for the X1=1 and X5=0 group. I guess that is it. Thanks. I know how to address it if need be, I was just expecting a larger glaring problem since the gbm model was near perfect - I have a couple of other things in the set to investigate for that. The following is a good paper that came out earlier this year on the topic and states options on how to address it:

https://academic.oup.com/aje/article-abstract/187/4/864/4084405

Thanks.
 

hlsmith

Omega Contributor
#11
Follow-up, a major component of my prior issue was i was using the wrong dataset, one which had a very imbalanced outcome. When I switched sets the gbm had an AUC around 0.63, which is more reassuring than 0.9995. User error as usual -> I blame the small laptop and R minimal D output in GLM :)

thanks for feedback @spunky and @ondansetron