Logistic Regression (very small proportion of 'yes')

Hi Everyone,
I am working on a logistic regression problem (injury [y/n] predicted by air temperature and physical training program). Problem is that only 1.2% of the entire 10000 sample had an injury, and this 1.2% is almost equally distributed between the 2 pt programs. I am investigating the raw data and temperature seems to be much more influential on injury compared to pt program. Besides the 2 pt programs both had 1.2% of individuals w/ an injury. When I run the log reg (2 factors - no interaction) it says that the pt program is actually more influential that temperature (pt OR = 1.9, temp OR = 1.4). Then if I add an interaction, the data is odd, the pt OR = 29811, but the interaction is slightly significant, although it is more in magnitude than directionally. I am just lost and don't know if I'm running in circles and i don't know if it is because hte proportion of 'yes' is just too small. ANY SUGGESTIONS???


Less is more. Stay pure. Stay poor.
The event rate is still over a hundred persons, which is reasonable.

Were you saying that temp is your only significant variable, but when you add back program and program*temp interaction term, the interaction term is significant?

If yes, I would attempt to plot probability for event by temperature and stratify by group, so you will have more than one line and see if the interaction makes sense from a physiological standpoint.

How many programs are there?
Thanks for the quick response. I have done that. I added the interaction term and saved the predicted probabilities. The interaction term does somewhat make sense, but its really only different in magnitude not in direction. I just don't understand why I would get an exp(beta) as large as 29811 for the main effect term. There are only 2 programs.
Attached is an image of my raw data to show how it really is temperature that is affecting proportion injured. Although there is an interaction, directionally series 1 and 2 (which are the 2 programs) have the same increasing effect through temp. Thanks again, really appreciate it.


Less is more. Stay pure. Stay poor.
Try to rerun your graph using estimated probabilities (y-axis) from the model vs temp (with statification). I don't believe much can be discerned from your current figure.

Write more about the odds ratio of 29811, what is that for, what was in the model, etc. I have never seen an OR that high.
My current figure is the raw data. I know what graph you are talking about, but I don't think much can be discerned from that considering the OR values seems weird. the OR of 29881 is for the OR of the treatment 1 vs treatment 2. It's an 'impossible number. And at first glance the graph i provided looks confusing, but this is the raw data summarized without even modeling. You can see the proportion (for both series) 'jumps' up when temperature starts to increase and there really isn't much difference between the two series'(treatments).


Less is more. Stay pure. Stay poor.
So the OR of 29,881 comes from program in the model with the interaction term, correct. Typically you do not interpret lower-order terms when an interaction term is present. So you would ignore the ORs for both program and temperature and now calculate the ORs for temperature stratified by program.
But if you had an interaction that isn't significant because of direction rather just a difference in magnitude, it is still worth looking at the main effect. I completely understand where you are coming from, but I still think I have to look at the main effects results just to make sure they make sense, and they definitely do not. is there any reason i should be concerned about running this with such a small proportion of 'yes' values for hte outcome. Also, in my classification tables for this model, none of the 100+ that were actually 'yes' were predicted to be yes. That also concerns me but should it?
I know that Tbachnick and Fidel suggest that binary predictor variables with a very high percentage [not raw number] of one level of the variable causes problems [attenuation of the slopes I believe]. I do not know how this effects logistic regression.

One problem that extremely high parameters can indicate is partial seperation of the data. Sometimes when that occurs the results won't run, depending on the software, but other times it will and huge SE and parameter estimates are a result.
It is actually the predictor that has a very high proportion of 'no injury' ~ 99%. So only 1% of my 10,000 subjects were actually injured. Does this change what you mean?


Less is more. Stay pure. Stay poor.
Off hand, as long as the model is not overparameterized I can't think of an issue. And you have at least 10-20 events per predictor.

What program are you using?


Less is more. Stay pure. Stay poor.
To my experience many interactions may not be changes in direction but uncongruent slopes. Thus the overall term effect modification, and interpreting a term that when stratefied is different and trying to interprete its crude effect can be inappropriate (though I understand what you are trying to imply). Do you want to post your output, so we can see what you are exactly referencing.

Also, did you test the fit of your model and linearity of the logit?


Less is more. Stay pure. Stay poor.
Side note, one good thing about your small number of events, based on the rare disease assumption, you can also try to calculate the relative excess risk due to interaction (RERI) to understand the additive interaction.
I am using SPSS. The results just seem odd to me because the raw data is so easy to see and interpret and the log reg results don't necessarily follow what I see in the raw data to a certain extent. But that interaction is significant so maybe we should just use that and forget the main effects. Any suggestions for model fit other than classification table and hosmoer lemeshow GoF test?
If you have signficant interaction you should definitiely not interpret main effects. You should interpret simple effects, the impact of the predictor at specific levels of the interacting variable.

There are a variety of critics of hosmoer lemeshow in part because it is largely atheoretical and can generate incorrect results. It also has low power I think. But there are not a lot of alternatives to it. There are deviance and chi square tests, but they are not always available in software and require the data to be in a specific form I think.