Linear probability model - Question about specification

#1
I saw a presentation at work a couple of days ago on a linear probability model and I jotted down the model specification so that I could later attempt to recreate it and I'm confused about the model specification. I do not have any experience with this type of model and I'm not a researcher/academic (I do data analytics/data science for a living) so please keep that in mind. The model was specified as:

1629988324042.png

I don't understand how this would work. Specifically:

  1. Would alpha be a vector of coefficients, one for each personal characteristic? If so, why isn't there an i subscript? Surely there isn't just one coefficient for all of the personal characteristics....?
  2. Would beta just be a a coefficient for the PT variable?
  3. Why aren't there coefficients for the occupation and region vectors?
  4. It doesn't seem possible that this could add up to only one. If phi and lambda are vectors holding dummy variables (0 or 1), you could already have a total of 2 just with those two variables.
Any help understanding this would be greatly appreciated!!
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
Yeah, I am not familiar with this nomenclature either - but for the most part LPMs are just like doing linear regression (OLS) with the dependent variable being 0 or 1. A person's training may dictate the used model nomenclature. You have to also accept that the person could have had a typo in the subscripts, since like you I would expect there would be multiple coefficients. Perhaps the person has an economics background where those people love PLMs. The one thing to keep in mind with PLMs is that the prevalence of the outcome should be ideally around 20-80% if you don't have a large sample size or precisions - since otherwise you could have a probability around the bounds (i.e., 0,1) and standard errors push the confidence intervals beyond them.
 
#3
The one thing to keep in mind with PLMs is that the prevalence of the outcome should be ideally around 20-80% if you don't have a large sample size or precisions - since otherwise you could have a probability around the bounds (i.e., 0,1) and standard errors push the confidence intervals beyond them.
That probably explains why I'm getting such terrible results in my attempts to model the same data that the presenter used. In this case, the source data indicate whether a respondent to a survey is an independent contractor vs. an employee, and only about 7% of all the respondents identified as independent contractors. When I try to model this using a GLM with a Binomially-distributed error and an identity link function, the model never predicts that anyone is an independent contractor :D