Help with GLM/logit diagnostics and result interpretation

#1
I have this complex toy example trough which I'm trying to learn how I can apply generalized linear models (a logit model to be exact where the response has two levels) to a research problem I'm battling. However, since I have very limited experience with these types of models, I need help to interpret the validity/goodness of the model.

I've been reading for days about the subject so I know a bit, but for example, how should the plots and the anova table be interpreted? Please help me out... :rolleyes:

Below are some results/printouts from my model. Keep in mind that the problem is hard, i.e., the success rate is low, around 60% with a optimal cut-off.

SUMMARY:
Code:
Call:
glm(formula = truth ~ factor_1 + factor_3 + factor_4 + factor_6 + factor_7 + factor_10 + 
    factor_11 + factor_12 + factor_13 + factor_14 + factor_15 + factor_17 + factor_18, family = binomial("logit"))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.10579  -1.14359   0.07018   1.15778   4.07823  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  	 1.006e+00  4.910e-02  20.484  < 2e-16 ***
factor_1        -1.518e-02  5.708e-04 -26.591  < 2e-16 ***
factor_3        -1.230e-01  5.697e-03 -21.588  < 2e-16 ***
factor_4         1.498e-02  6.399e-03   2.342 0.019188 *  
factor_6         4.784e-01  2.831e-02  16.896  < 2e-16 ***
factor_7        -1.736e-01  1.550e-02 -11.202  < 2e-16 ***
factor_10       -5.781e-07  2.958e-08 -19.546  < 2e-16 ***
factor_11       -1.110e-03  2.042e-04  -5.437 5.43e-08 ***
factor_12       -1.137e-01  3.306e-02  -3.439 0.000584 ***
factor_13       -8.764e-02  3.650e-02  -2.401 0.016338 *  
factor_14       -3.583e-01  3.455e-02 -10.371  < 2e-16 ***
factor_15       -3.363e-01  4.176e-02  -8.052 8.16e-16 ***
factor_17        1.800e-06  6.156e-08  29.232  < 2e-16 ***
factor_18       -6.518e-03  3.175e-03  -2.053 0.040089 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 207944  on 149999  degrees of freedom
Residual deviance: 203171  on 149986  degrees of freedom
AIC: 203199
ANOVA:
Code:
Analysis of Deviance Table

Model: binomial, link: logit

Response: truth

Terms added sequentially (first to last)


	 Df Deviance Resid. Df Resid.  Dev P(>|Chi|)       
NULL           	         149999     207944              
factor_1   1  1393.15    149998     206551 < 2.2e-16 ***
factor_3   1   934.43    149997     205617 < 2.2e-16 ***
factor_4   1    77.46    149996     205539 < 2.2e-16 ***
factor_6   1   289.88    149995     205249 < 2.2e-16 ***
factor_7   1   365.69    149994     204884 < 2.2e-16 ***
factor_10  1   349.35    149993     204534 < 2.2e-16 ***
factor_11  1    36.79    149992     204497 1.315e-09 ***
factor_12  1   120.22    149991     204377 < 2.2e-16 ***
factor_13  1   162.69    149990     204214 < 2.2e-16 ***
factor_14  1    48.78    149989     204166 2.862e-12 ***
factor_15  1    71.28    149988     204094 < 2.2e-16 ***
factor_17  1   919.35    149987     203175 < 2.2e-16 ***
factor_18  1     4.22    149986     203171   0.04005 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
WALD TEST:
Code:
Wald test

Model 1: truth ~ factor_1 + factor_3 + factor_4 + factor_6 + factor_7 + factor_10 + factor_11 + factor_12 + 
    factor_13 + factor_14 + factor_15 + factor_17 + factor_18
Model 2: truth ~ 1
  Res.Df  Df Chisq Pr(>Chisq)    
1 149986                         
2 149999 -13  4483  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Plots:
 
Last edited:

terzi

TS Contributor
#4
Hi Isac3,

The first part of the output is just the estimations, std. errors and a test to see whether the coefficient is zero. The interpretation is the same as that one from linear regression.

Still, Generalized Linear Models are very different from "common" linear models. First, remember that you are dealing with maximum likelihood estimations so that complicates things a little. For instance, there is no ANOVA table in this type of models. I hadn't seen that under the name of "deviance analysis" but it is just some kind of "stepwise" procedure. Deviance is a measure of goodness of fit, and what you see in that table is how that measure changes when you add each term to your model. The line defined as NULL is the intercept-only model.

I would suggest you to start reading topics on logistic regression, not as a case of generalized linear models, but as a single topic. That way you will understand terms as deviance, information criteria, likelihood ratio tests, etc. and also some cool things that will aid for interpretation, like odds ratios. After that go to the general model. Also, start with only a few factors in your model:).

Of course, for any further help, feel free to ask.