Bias in Zero-One Inflated Beta Regression?

We have recently been trying to use the zero-one inflated beta regression macros that were published as part of SAS Global Forum 2012 (see... they have been very helpful in more accurately modelling the Loss Given Default (LGD) data that we encounter.

I am puzzled by something we have observed, and was hoping someone might have some suggestions as to how to proceed.

In our dataset, the probability of the zero and one outcomes are ~55% and ~22% respectively. After modelling, the average model-predicted probabilities are ~71% and ~48% respectively. I am surprised that: (i) the predicted probabilities appear to be biased (i.e., they are not replicating the observed probabilities in the dataset); and (ii) that they sum to >100%, which is obviously impossible.

We want to use the model to make forecasts, but given the 'raw' predictions this would obviously result in overestimation. I am thinking of making some simple adjustments (i.e., applying ratios of 55%/71% and 22%/48% to model predictions), but I am wondering whether someone might have some other suggestions that we should consider.

Appreciate any thoughts you might have.
Last edited by a moderator:


Less is more. Stay pure. Stay poor.
Is it kicking out an estimate for '0' and '1' or are you calculating those based on coefficients? For example, I can fit a logistic reg that gives a predicted prob for death in diabetics of 34% and predicted prob of death in hypertensives of 80%. They sum > 1, but are independent covariates since they are entered into the model as IVs.

Could anything like this be going on?
The model is set up as a mixture density with, where each of the four parameters (pi0,pi1,mu,theta) has distinct parameter estimate and covariate design matrices (see... for full details).
The parameter estimates are obtained via ML and then used to get predicted values for each observation. So, by considering the covariate values for that each observation, I get model-fitted predictions of pi0, pi1, mu, and theta for each observation. Given the mixture design and the 'simultaneous' fit of the 4 parameters of the model, I would think it should be algebraically impossible for sum(pi0,pi1)>1; yet that is exactly what we're getting for many observations.


Ambassador to the humans
You say you're using covariates? Do the predicted probabilities sum to less than one for each observation? It sounds like the distribution of the covariates might be playing a factor here since you're talking about averaging over all the observations.
Yes - there are covariates being used to model each of the four parameters in the density. No - we have many observations where sum of the predicted probabilities (pi0_hat + pi1_hat)>1; which seems like a fundamental 'failure' of the model.