I have been trying to develop a logistic regression model which predicts future enrollment in part 2 of a required program sequence, based on characteristics displayed in part 1 of the sequence. Some people have missing data points and so some observations are dropped when building the model. The model is built upon about 1500 observations, with about 300 excluded.

When I test the model against past data, I am finding that it consistently over-estimates part 2 enrollment by about 8-12 percentage points, for every single year of the past 10 years worth of data. This strikes me as unusual. If the model fit is poor, shouldn't I (likely) have some years that are under-estimated instead of all being over-estimated, even though I only have n = 10 years of data to compare to?

Looking at my descriptives, I see that the proportion of program drop-outs in the population from which the model is built is about 23%. I also see that the proportion of program drop-outs in the population which was excluded from the model building data (due to missing data points) is about 27%. My first instinct is that not enough of these persons are being included in the model parameter estimates, making my model "over-confident" about enrollment predictions.

Is this what is most likely happening? Are there other possibilities that I should explore? Thanks for any help that you can offer!

techsassy