Logistic Reg Complete Seperation

hlsmith

Not a robit
#1
Hey y'all,

I was running a stacked ensemble (weighted combo of base machine learners) model yesterday (R: H20: autoML). And I noticed the top contributing model had an accuracy of 99.995%. It was a gradient boosted model (per random grid search) for a classification problem. I thought hey maybe the model really is that good, since I am not over familiar with gbm. So today I wanted to check it out and ran a logistic reg on the problem and got a complete separation error. I am on a laptop and using R, which I am not great on the little computer or R. So I toyed around with the model by excluding a single covariate at a time in logistic reg. I notice when either X1 or X5 were excluded the model would run with out error. So I then generated the three different 3-way contingency tables (since variables were binary: Y, X1, X5) and did not notice any null cells in the tables.

What do you all think for investigating this?

P.S., Side note, the model is actually modeling the missingness of a variable in the dataset, so (prob(missing(y/n) | X)
 
Last edited:

hlsmith

Not a robit
#2
I was just thinking if the covariates were continuous, i would be able to be to see the separation like in support vector machines, so shouldn't I be able to see the issue in data via the contingency tables, and wouldn't there just be 3 unique contingency tables per stratifying by 1 of 3 other variables?
 

spunky

Doesn't actually exist
#3
I notice when either X1 or X5 were excluded the model would run with out error. So I then generated the three different 3-way contingency tables (since variables were binary: Y, X1, X5) and did not notice any null cells in the tables.
Ok, I'm gonna preface this by saying I'm not really an expert on this but I always assumed that when you found separation in a logistic regression setting, it could basically come up from any n-way table that includes the suspect variables. Like maybe X1 and X5 play a role in this, but only in presence of say, X2. So a 3-way table that doesn't include X2 will not show the problem. If your number of regressors is not large, you can maybe try and all-possible subsets ensuring that the suspect covarites are present and inspect those contingency tables.
 

hlsmith

Not a robit
#4
@spunky thanks. Good point. I will also not that the model y~X1+X5 will run, which supports your comment. Tibshirani may have a subset package, though it could end hurting the laptop :( However, I think it uses some type of optimization to address outrageuous large combinations. I have ten predictors I believe. To the internet...
 

hlsmith

Not a robit
#5
Well R has a package for detecting separation, brglm2. I got the following which I am gonna have to think about since it says the same thing I was about these data.

_____________________________________________________________
Separation: TRUE
Existence of maximum likelihood estimates
(Intercept) X1 X2 X3
-Inf -Inf 0 0
X4 X6 X7 X8
0 0 0 0
X5 X9 X10
Inf 0 0
0: finite value, Inf: infinity, -Inf: -infinity
______________________________________________________________


So am I not taking the intercept into value when trying to visualize , so all of the reference groups plus X1 and X5. Or does the intercept go -Inf just because of the other (-)inf?
 

hlsmith

Not a robit
#6
Could this actually be the issue? Holding all other covariates at reference value you get 1 subject with Y=0, X1=0, X5=0. Hmm. confusing myself now.

Edited: Ran this in SAS and it calls it a
"possibly a quasi-complete separation of data
points."
 
Last edited:

spunky

Doesn't actually exist
#7
Or does the intercept go -Inf just because of the other (-)inf?
I think that is the case. Remember that the intercept in regression models is a function of the mean of the covariates and the regression coefficients. If one of them is Inf then any arithmetic function of that would result in an Inf too.
 

ondansetron

TS Contributor
#9
I'm pretty sure separation refers to the sparsity of outcomes in levels of an IV. For example, all males had heart attacks (1) and all females did did not (0) so M/F tells you MI vs not.
Quasi-separation is where almost all of the cases per level have the same outcome (98% males had MI, (97% females did not).

Basically sparsity of outcomes in a level of a categorical independent variable.
 
#10
Yes @spunky and @ondansetron - I am familiar with the definition. I think my issue was the R error is likely the same for "complete" and "quasi", so I was looking for complete with the tables. But then SAS better articulated that it was "quasi". When I looked at the potential sparsity their is one cell with a value of 6 in the smaller of the outcome groups for the X1=1 and X5=0 group. I guess that is it. Thanks. I know how to address it if need be, I was just expecting a larger glaring problem since the gbm model was near perfect - I have a couple of other things in the set to investigate for that. The following is a good paper that came out earlier this year on the topic and states options on how to address it:

https://academic.oup.com/aje/article-abstract/187/4/864/4084405

Thanks.
 
#11
Follow-up, a major component of my prior issue was i was using the wrong dataset, one which had a very imbalanced outcome. When I switched sets the gbm had an AUC around 0.63, which is more reassuring than 0.9995. User error as usual -> I blame the small laptop and R minimal D output in GLM :)

thanks for feedback @spunky and @ondansetron