Logistic regression probabilities


New Member

I have built a logistic regression model that predicts the response probability from a marketing campaign. The model ranks correctly (i.e. the lower scored customers take up the offer less often than higher scored customers), but the probabilities predicted by the model are never the same as the actual take-up percentage (i.e. the model scores the customer with 0.25, implying that 1 in every 4 customers with the same score should take up the offer, but in reality only 1 in every 10 customers with the same score takes up the offer).

What could possible reasons for this be? I have checked for multicollinearity, and that is not the issue, and the sample size was sufficient (about 10000 obs for a model with 15 predictors). My gut tells me that it might have something to do with the link function (logit was used), but I don't really understand link functions, so if you think it could be something like that, please try to explain.



No cake for spunky
There are lots of reasons it could be including violations of various regression assumptions and (most likely) misspecification. You either left variables out of the model or you specified the wrong functional form. Multicollinearity won't bias the slopes, it just influences the results of the statistical test through the standard errors. Is it possible that a small sample size is changing the estimated effect size?

I can't immagine why it would be the link function. Different link functions commonly generate similar results. For instance you can multiply probit by a constant and get the logit results.


Less is more. Stay pure. Stay poor.
Could there also be an undefined interaction term? So perhaps the effect is mediated by an interaction that you have not addressed.


New Member
Thank you for your response!

I suppose I could look at the functional form, although I doubt that would be the problem since most of the variables are discrete. We have tried quite a few variables (about 400), but I could maybe try some additional variables. I do appreciate that you ruled out the link function :)


No cake for spunky
One thing you can try, although it won't explain why you have a problem, is a goodness of fit test such as Hosmer-Lemeshow that shows you how well your data fits the actual regression results. At least this will tell you to some extent if better models are improving fit.

Another issue that was not mentioned is that an extreme outlier or outliers may be distorting the results and/or moving the regression line. Looking for standardized residuals that are beyond 2 (or possibly 3) or at DFBETA (which shows how far the regression line moves when you remove this point) is a simple way to see if this is an issue.