Logistic Regression Skewed Data with Lots of zeros!

#1
Hello,

I am working on a Logistic Regression where the results that I am getting are not satisfactory. The hosmer and lemeshow test is being rejected and principally the classification table is not classifying my predicted values, such as getting 100% for sensitivity with prob. level 0.00 . I am working with a very skewed finantial data with lots of zero. I already tried to standardize, log the data but I am getting similar results. Does anybody have any clue of what else should I do according to the skewed data with zeros and standard deviation higher than the mean?

Thanks in advance,

Marcio
 

noetsi

Fortran must die
#2
As far as I know logistic regression does not assume multivariate normality, so I don't understand why skewing would matter.

I am not sure what you mean by "lots of zeros." Large numbers should not create problems for the predictor variables. A standard deviation higher than the mean can be caused by lots of things. For example partial seperation can cause it (which SPSS will not warn you exist - what software are you using). Multicolinearity will too I believe - have you done a VIF or tolerance test of your predictors? A small sample size could also be an issue.
 

Dason

Ambassador to the humans
#3
I'm just not sure what the OP is talking about. I mean I'm sure I understand the statistics but when they say "data" are they referring to the response? The predictors? It's not clear to me at all.

Marcio - can you please describe what you mean in more detail please?
 
#4
Sure that I can explain better. My situation is the following. I have a data set with 350,000 cases where about 1% have the event that I am observing. My data set contains a set of finantial predictive variables where most of then have about 50% to 75% of zeros. I imputed few variables such as revenue by the mean and I did not threat outliers yet. I checked for VIF and excluded the correlated variables for the input model. As I mentioned for the results to assure the model have a good classification are not satisfactory, as the hosmer and lemeshow test is rejected and the classification table is not classifying as the sensitivity is 100% where prob level equals 0. My question is more related to the classification table of what could be the reason when does not classify at all the event?

Thanks in advance
 

noetsi

Fortran must die
#5
You might want to look at Paul Allison's "Logistic Regression Using SAS" 2nd ed p66-68. He is critical of this (popular) test as are others. Among other things he notes;

1) The test if not very powerful (this means models that are wrong will appear right since rejection of the null in this test is a sign of bad fit). From what you say this is not what you are having trouble with, however.
2) The number of groups used can dramatically change the p value. So you might find a p value that leads to rejection with 9 groups, but not at all with ten. The number of groups used by software is essentially artificial.
3) It does not deal well with interaction terms (that is highly signficant interaction terms can make the HL statistic much worse which should not occur).
 

noetsi

Fortran must die
#7
It's also possible that your model just does not fit the data well.

Another alternative is to do chi square or deviance tests (these test goodness of fit as well).
 
#8
I was about to cite the same passage from Allison's "logistic regression using SAS" about the problems with the hosmer and lemeshow test. Also, I've heard that imputing the mean for missing data is not a good practice and that imputing values based on the data you do have can be better.

I'm not sure this will help (I'm learning about logistic regression myself) but maybe try changing your sample from 1% with the event and 99% non-event to a more even break. So you'd keep the 3,500 cases that have the event then pick a random 3,500 cases from the remaining 346,500. According to "Logistic regression using SAS", the intercept will be off but the slope coefficients will be unbiased estimates of the slopes in the full population.

Also, I didn't fully understand the sensitivity question. Sensitivity of <100% means that the model you have did not predict 100% of the events correctly for a given cutpoint. Say your cutpoint was 50% probability. If every single time your model predicts 50% or greater chance of the event happening, the event actually happens, then you'd have 100% sensitivity. If your cutpoint is very low, say zero, then you'll get 100% sensitivity even if your model is terrible (in which case your selectivity would be near zero, since you didn't predict non-events well). I think you wrote that your cutpoint was zero and you didn't get 100% sensitivity. If that's what happened, I don't know how that's possible. Maybe the problem is clear enough now for someone else to answer ...?