linear probability models

noetsi

Fortran must die
We have dependent variables that have two levels. I think that should be logistic regression, but the federal agency we work for says the model should be linear (OLS I believe).

So what problems do I need to look out for, and what diagnostics should I use. Since the data will automatically be heteroscedastic I am not sure there is any point in testing for that.

noetsi

Fortran must die
"If the probability is between .20 and .80, then the log odds are almost a linear function of the probability."

How do you know if this occurs. That is how can you test if the probabilities like between .2 and .8

The residuals will be heteroskedastic and non-normal. You can use white SE for the former. But given the later is true can you use t test at all for the slopes?

Last edited:

hlsmith

Less is more. Stay pure. Stay poor.
Yeah the biggest issue to watch out for is that the Ys and especially Y-hats are not too close to the bounds, otherwise you can end up with estimates outside of possible values. This is also very relevant to confidence interval estimates.

Otherwise, I think you treat everything else as you typically do in OLS.

noetsi

Fortran must die
Yeah the biggest issue to watch out for is that the Ys and especially Y-hats are not too close to the bounds, otherwise you can end up with estimates outside of possible values. This is also very relevant to confidence interval estimates.

Otherwise, I think you treat everything else as you typically do in OLS.
How do you test for if the Y hats are near the bounds. Of if the probabilities are in the .2 to .8 range.

Dason

Well the y_hat values are just the predictions. Maybe you could look at the predictions and see if any are in that range...

hlsmith

Less is more. Stay pure. Stay poor.
Also, sample size will influence the precision of y-hats. So if someone had a small sample, a value of say 0.79 could possibly have a confidence interval beyond 1.0. That and as you know, multicollinearity can also increase SE values.

noetsi

Fortran must die
Well the y_hat values are just the predictions. Maybe you could look at the predictions and see if any are in that range...
Do you mean the predictions are outside the .2 to .8 range? I did not think the predictions were percentages and I think that the .2 to .8 reference is to percentages.

I will have hundreds, probably thousands of cases so the sample size will not be small.

hlsmith

Less is more. Stay pure. Stay poor.
Well what is the outcome format? Probability, which ain't too far removed then percentages. So probabilities near zero and one are troublesome.