Zero inflation in binomial logistic regression

#1
Hi!
I investigate the prevalence of co-infection in a sample of ticks. My dataset contains an excess amount of zeros (269/286), and the result of my binomial logistic regression model is non-significant. I would be grateful if anyone can tell me why this excessive number of zeros is a problem? what happens when you have too many zeros? Why is the model no longer a good choice? In the end, I assume it means that the risk of infection can be underestimated and that I instead should use Zero-inflated negative binomial regression. I just would like to really understand what happens with the binomial logistic model when you have too many zeros before I move on.

I would really appreciate your help!

Kind regards/ Hanna
 

Karabiner

TS Contributor
#2
I investigate the prevalence of co-infection in a sample of ticks. My dataset contains an excess amount of zeros (269/286), and the result of my binomial logistic regression model is non-significant.
The most simple model would predict "zero" for each case. That model would be correct in 269/286=94% of cases.
It s difficult to improve that.

Essentially, you try here to predict 17 tick cases from 286 total cases. How many predictors do you have? If there are
more than 1 or 2, you will be left with a poor ratio between number of cases to be predicted and number of predictors.

With kind regards

Karabine
 
Last edited:

hlsmith

Less is more. Stay pure. Stay poor.
#3
Not sure you need to move on. If the outcome is truly binary (co-infection: y/n), your issue is just a rare outcome. If you are thinking about using zero-inflated binomial or Poisson, well Poisson is for count data and you have a binary variable, not sure about binomial version - you can post a link.

With rare outcomes, issues can be related to model convergence, since there can be complete separation. This is just when given the finite sample size it seems like a covariate may perfectly predict the outcome class. Another issue with rare outcomes is when the model converges it generates large standard errors, and that is just a bi-product of uncertainty given the small subgroup.

Fixes to get models to converge, if that is your issue, include using Firth's correction or exact procedures. Another option is using a Bayesian model with priors. However, you state your model is non-significant, so it must be converging. So what is your issue with non-significance? Maybe that is just what it is and you don't need to try and P-HACK to get signification, right. Why is the result wrong given it is insignificant? There could be a small signal there, but you just don't have enough data to rule out chance given the rare outcome.

Welcome to the forum!
 
#4
Thank you both Karabiner and hlsmith for your answers!
In this analysis, I only use one independent variable, if they are attached to a host or free in nature. And my outcome is co-infection 0/1.
The reason why the non-significant result feels wrong is that all of my co-infections occurred in the host-attached group, so it "feels" like host attachment somehow would affect the rate of co-infection. My other problem is that my result is somewhat strange. Instead of getting a "normal" odds- ratio that is 8.3E+07 and a very large standard error or >1700, in my other analysis the SE is <1. But this might be explained by what hl smith said "when the model converges it generates large standard errors, and that is just a bi-product of uncertainty given the small subgroup. "
After the analysis, I did a hypothesis test, an ANOVA II that was significant. And an effect plot that was just a straight line. I'm, not 100 % sure that all these results are because of excess amounts of zeros, I just thought that was the most likely explanation. When this happens, is it a risk that the analysis underestimates the risk? And when I try to explain this result to someone, why I might not think that this model is the optimal choose, what is the best way to put it.

I really appreciate your help, I'm new at statistics, and it is like a whole new world to me.

Kind regards
Hanna
 

katxt

Active Member
#5
So your data is two columns of 0/1. You have a great stack of 0,0, many 1,0 and some 1,1 If I have this right it sounds like a chi square situation.
 

katxt

Active Member
#6
The reason why the non-significant result feels wrong is that all of my co-infections occurred in the host-attached group, so it "feels" like host attachment somehow would affect the rate of co-infection.
Or, if all you want is proof of your "feeling", find the probability of the 17 positives all being in the one group if they are randomly distributed across the whole lot.
 

Karabiner

TS Contributor
#7
In this analysis, I only use one independent variable,
I have to underscore what @katxt said - why do you perform logistic regression here?
It is like shooting with cannons on sparrows. A 2x2 table with Chi² statistic is suffciient.
The reason why the non-significant result feels wrong is that all of my co-infections occurred in the host-attached group, so it "feels" like host attachment somehow would affect the rate of co-infection.
The analysis seeks to replace feelings by calculations. You only have 17 cases in the smaller group,
and the analysis tealls you that the results are not strong enough to reject the null hypothesis.
This doesn not necessarily mean that your impression is wrong, but the empirical evidence here
is not yet strong enough.
After the analysis, I did a hypothesis test, an ANOVA II that was significant.
Which data did you use for that, and how is it connected to the other analysis?

With kind regards

Karabiner
 

katxt

Active Member
#8
If you want a predictive model, you can make one along the lines of "If it is attached the the probability of coinfection is between ... and ... otherwise the probability of coinfection is between ... and ... For 0 cases there is still a small probability range. kat