# Linear Probability Model

#### noetsi

##### Fortran must die
I have a question about the following statement:

For the logistic model to fit better than the linear model, it must be the case that the log odds are a linear function of X, but the probability is not. And for that to be true, the relationship between the probability and the log odds must itself be nonlinear. But how nonlinear is the relationship between probability and log odds? If the probability is between .20 and .80, then the log odds are almost a linear function of the probability (cf. Long 1997).

What do they mean about the probability being between .2 and .8 here? That the probability of Y being 1 is between .2 and .8?

https://statisticalhorizons.com/linear-vs-logistic

#### Dason

Sure. I mean it's a bit more general than that but there's nothing wrong with thinking about it the way you are.

#### hlsmith

##### Not a robit
I believe its the whole sinusoidal thing. So typically between those values their is a linear slope. But you get the curves and respective asymptotics at the tails.

#### noetsi

##### Fortran must die
Thanks. As a follow up how do you know if your data reasonably has a probability between .2 and .8(is their a test to run, do you generate some curve and look at it....).

All my training/education taught me that when you have a binary dependent variable you use something like logistic regression not OLS. But the organization I work for has been told by the federal government to use a linear probability model for this (which apparently is the norm in economics) so I will.

#### Dason

Well "linear probability model" doesn't necessarily imply OLS. You can use maximum likelihood or some weighted least squares approach to do the actual estimation.

#### noetsi

##### Fortran must die
Here is a related question. The following author states that the problems inherent in linear models with binary DV do not apply when the predictor is a binary variable (or I think they are saying that)? Is this true, and is it still true if there are dummy and interval predictors in the same regression?

The main reason that the LPM works so well to estimate experimental impacts is that treatment status is a binary variable (nota continuous variable, which would be subject to the potential bias described above). This means that the functional form concerns about LPM do not apply to estimating impacts, since all that is required is to estimate two prevalence rates—one for the treatment group and one for the control group (as opposed to estimating a different prevalence rate for every unique value of a continuous variable).

#### Dason

Are there other variables or is it a single binary predictor?

#### noetsi

##### Fortran must die
I am confused about the difference between marginal effect (the slopes I think) and predicted probabilities are in practice. Say I am interested in what variables have the greatest impact (in a correlational study) on whether you are going to be successful or unsuccessful in a program (binary response variable)? Does this involved effect size or prediction?

The results from both the simulation study and eBay analysis indicate that LPM performs similar to logistic and probit models in terms of coefficient significance, effect size (marginal effect), classification and ranking. LPM coefficients have the added advantage of easier interpretation. But LPM is inferior to logit and probit if predicted probabilities are of interest.

#### GretaGarbo

##### Human
The thing with generalized linear models is that you can choose your self what kind of link function you want. Just like you can choose to include or not to include an x-variable. So the link can be the logit link function or the identity link function (like in LPM) or the probit link of the complementary log-log link. You can even...
(watch our Spunky here is a trigger warning) .... even have the Cauchy distribution (i.e. the Cauchyit link). It is your model and you can choose one that fits to the data.

Also, you can choose estimator. It could be OLS or it could be ML. Even in the LPM-model you can use the ML (just choose the identity link function) so that the differences in variances is taken into account. ML is iteratively re-weighted least squares.

But it doesn't matter that much if you choose OLS. When p=0.2 the variance would be proportional to 0.2*(1-0.2)=0.16, because it comes from the binomial distribution. And when p=0.5 the variance is 0.5*(1-0.5)=0.25 and that is not so different from 0.16.

Of course there can be a continuous x-variable. But it must only be in an interval (x_min ; x_max) so that it gives predicted values of p between 0.2 and 0.8. I guess that the author intended an x-variable along the whole real line from -infinity to +infinity. Then the p would be outside of the interval 0.2 to 0.8.

But I find this amusing:
....has been told by the federal government to use a linear probability model for this
The government is choosing the model for you! Not what fits to the data.

#### noetsi

##### Fortran must die
The federal government has decided both what variables to use and the methodology. When you get 79 percent of your money from the federal government that is how it works.

#### hlsmith

##### Not a robit
I didn't exactly follow your quote, but is their lack of concern with a binary variable being you the slope between the two values is just a straight line. The same lack of concern in regards to the linearity of the logit for binary predictor versus a continuous predictor.

#### noetsi

##### Fortran must die
I don't understand your last comment hlsmith

#### hlsmith

##### Not a robit
Referring to the logistic regression assumption of linearity in the logit, which isn't a concern when predictor is categorical.