# Linear Probability Models

#### noetsi

##### Fortran must die
There is a dispute whether OLS is valid when the dependent variable takes on 2 levels [in which case one can use linear probability models] or if one must use logistic regression. This seems central to me to the dispute [although the violation of normality and hetero is inherent in LPM I have whole populations so I am not as concerned about that].

"“These considerations suggest arule of thumb. If the probabilities that you’re modeling are extreme—close to 0or 1—then you probably have to use logistic regression. But if the probabilities are more moderate—say between .20 and .80, or a little beyond—then the linear and logistic models fit about equally well, and the linear model should be favored for its ease of interpretation.”

My question is how do you know if most probabilities are between those values. Ideally I would like the answer in how do you find this in SAS, but I will take what I can get." #### Dason

Fit a model and check the predicted values

#### noetsi

##### Fortran must die
Yeah I think I have, but I am checking to be sure the sas output file is doing what I think it is.

#### noetsi

##### Fortran must die
Being new to this I am confused. For each case there is a probability of them taking on a value of 1 and 0 (the dv has two levels). When they say that the probabilities are largely between .2 and .8 are they talking about the probability of 1 or of 0 or some combination? About 2/3 of the dv has a value of 0.

#### Dason

Either or. It literally doesn't matter. Choose one. The answer is the same either way.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Make a histogram of them to better visualize the distribution!

#### noetsi

##### Fortran must die
I did. It shows a badly distorted distribution.

#### Dason

You don't need more than your quartiles to tell you that the lpm probably isn't adequate.

#### noetsi

##### Fortran must die
Thanks Dason. The issue will be that the federal government has chosen this model based on economic advice and has used it for 20 years. I have to show them very formally its wrong if it is.

#### Dason

How many predictors do you have in your model?

#### Nonlinear_Zero-Sum

##### Member
Probability and odds have a nonlinear relationship.

Distortion (error, misunderstanding) is created when you force a linear functionality between odds and their implied probabilities, especially at the extremes ... which could be a significant factor in the 'longshot bias'. Note: The absolute error -- the difference between linear and nonlinear conversions -- appears to be shrinking as the underdog's odds increase, but the relative error goes thru the roof.

Last edited:

#### noetsi

##### Fortran must die
One thing I find puzzling. I ran the data set addressed above (in part). About 30 predictors and a dependent variable with two levels 0 and 1. I generated an ODDS ratio and then ranked the predictors (most but not all of these were dummy variables) from highest to smallest odds ratio (for odds ratio below 1 I took the inverse of the odds ratio before ranking, other wise an odds ratio of one would show greater impact than .1 which is obviously incorrect).

Separately , I ran the same variables through linear regression (a linear probability model). I ranked the variables based on the absolute value of their slope. I expected the relative rankings to pretty close between the two approaches. But in fact there are some significant differences.

My guess is this is caused because the linear model used OLS and the logistic regression used maximum likelihood. I can not think of any other reason the rankings would be different.