# proc genmod binary DV linear probability model

#### noetsi

##### No cake for spunky
Three questions.

First for the predictors that have two levels, proc gen mod shows the results associated with the zero level of the predictor. I want to know the probability associated with being in the 1 level (for example being Hispanic when you are measuring if one is Hispanic or not). To do this I can just reverse the slope right? So if it says -20 for the 0 level of the predictor (not the DV) I can report 20.

I can not find in the log if its predicting the 0 or the 1 level of the DV. Or anywhere else.

Someone said the linear probability model should be done this way in proc genmod.

this model would be kicking out the probability values
proc genmod descending;freq count;
model attend = score /link=identity dist=bin;
run;

I have no idea what the dist=bin means in this case or whether I should do it. Or for that matter what the freq count is doing.

#### noetsi

##### No cake for spunky
I am looking for an answer to this in many places but have not found it yet. I am running a linear probability model. My dependent variable has 2 levels 0 and 1. All the predictors I am interested in have two levels as well (dummies coded 0 and 1). It is not clear to me whether the results from Proc Genmod are showing the increased probability of being at level 1 in the dependent variable or level 0. I ran this code (I don’t show the CLASS or MODEL statement because they are very long I have 50 plus variables in the model).

PROC GENMOD DATA=WORK.SORTTempTableSorted

PLOTS(ONLY)=None

;

I am using the defaults for everything. I assume that Genmod with the defaults predicts level 1 (shows the increased or decreased chances of being in level 1 on the DV), but I am not certain. I also assume that it leaves the coding of the dummy predictors the same, so a 0 remains a 0 and a 1 a one.

So in the above results if you are at level 0 there is a negative 28.39 percent chance of being in level one of the dv.

#### noetsi

##### No cake for spunky
Strange I added this which one author suggested

DIST=BINOMIAL

and got

WARNING: The specified model did not converge.

NOTE: The Pearson chi-square and deviance are not computed since the AGGREGATE option is not specified.

ERROR: The mean parameter is either invalid or at a limit of its range for some observations.

Which does not happen with the default dist=normal

It is interesting that when you specify this it tells you that it is modeling the chance of it being 0. So I assume when you use the defaults of normality it is just treating 0 and 1 as interval and not modeling moving from one state to another.

Not sure what I do because the model won't run at all when I specify binomial.

#### noetsi

##### No cake for spunky
I found this note in sas which is distressing....because my data obviously can't be done with a linear probability model and logistic regression is not an option either given that the federal government wants a lpm

Another approach fits a linear probability model with PROC GENMOD (using maximum likelihood estimation) or PROC CATMOD (using weighted least squares estimation). Note that some data might not be well fit by a linear probability model. Various error conditions, such as invalid mean parameter, can occur during model fitting. In PROC GENMOD, use the DIST=BINOMIAL and LINK=IDENTITY options to model the binomial probabilities directly, rather than the logits.

So if you estimate a linear model with proc genmod (distribution is normal not binomial) and you don't care about the SE (I have the population) will the slopes be biased? All I really want to do is tell my audience that controlling for this set of variables this level of the predictor has an impact on employment. But I need to control for a set of variables I just can't run descriptives.

And how do you interpret slopes when you run the distribution is normal not binomial...with a two level DV.

The regression actually generates results, but I don't know if they are valid or not. I think this error means some values are outside the accepted range which makes sense. Just not sure if I can use the slopes this way.

Last edited:

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Three questions.

First for the predictors that have two levels, proc gen mod shows the results associated with the zero level of the predictor. I want to know the probability associated with being in the 1 level (for example being Hispanic when you are measuring if one is Hispanic or not). To do this I can just reverse the slope right? So if it says -20 for the 0 level of the predictor (not the DV) I can report 20.

I can not find in the log if its predicting the 0 or the 1 level of the DV. Or anywhere else.

Someone said the linear probability model should be done this way in proc genmod.

this model would be kicking out the probability values
proc genmod descending;freq count;
model attend = score /link=identity dist=bin;
run;

I have no idea what the dist=bin means in this case or whether I should do it. Or for that matter what the freq count is doing.
Descending I think says to use 1s as ref, but need to check. The other parts seem fine. Freq = count is likely a weighting statement, if you don't have weights, remove it. Distribution=binomial, you have a binomial DV.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Strange I added this which one author suggested

DIST=BINOMIAL

and got

WARNING: The specified model did not converge.

NOTE: The Pearson chi-square and deviance are not computed since the AGGREGATE option is not specified.

ERROR: The mean parameter is either invalid or at a limit of its range for some observations.

Which does not happen with the default dist=normal

It is interesting that when you specify this it tells you that it is modeling the chance of it being 0. So I assume when you use the defaults of normality it is just treating 0 and 1 as interval and not modeling moving from one state to another.

Not sure what I do because the model won't run at all when I specify binomial.
It won't converge, and you don't have any continous variables in the model, that is usually what triggers this, the outcome is falling outside the 0,1 bounds for a variable. If it is kicking out some estimates but no SEs you can bootstrap to get CIs.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
I found this note in sas which is distressing....because my data obviously can't be done with a linear probability model and logistic regression is not an option either given that the federal government wants a lpm

Another approach fits a linear probability model with PROC GENMOD (using maximum likelihood estimation) or PROC CATMOD (using weighted least squares estimation). Note that some data might not be well fit by a linear probability model. Various error conditions, such as invalid mean parameter, can occur during model fitting. In PROC GENMOD, use the DIST=BINOMIAL and LINK=IDENTITY options to model the binomial probabilities directly, rather than the logits.

So if you estimate a linear model with proc genmod (distribution is normal not binomial) and you don't care about the SE (I have the population) will the slopes be biased? All I really want to do is tell my audience that controlling for this set of variables this level of the predictor has an impact on employment. But I need to control for a set of variables I just can't run descriptives.

And how do you interpret slopes when you run the distribution is normal not binomial...with a two level DV.

The regression actually generates results, but I don't know if they are valid or not. I think this error means some values are outside the accepted range which makes sense. Just not sure if I can use the slopes this way.
Ashley Naima has a paper on this, if you are still getting estimates just not SEs I believe you can use the estimates!

#### noetsi

##### No cake for spunky
Ashley Naima has a paper on this, if you are still getting estimates just not SEs I believe you can use the estimates!
Do you have that paper. I am getting estimates not SE. I don't even care about the SE because I have the population I care about. I just am not sure if the estimates are right.

#### noetsi

##### No cake for spunky
I found this comment amusing. Pretty sure he is an economist although he is associated with a university medical program.

Other than interpretation of coefficients or a first pass to modeling, there are NO GOOD REASONS TO USE THE LPM model Some researchers (ok, economists, mostly) truly love the LPN because the parameters are easy to interpret and often the effects are close enough Yet, in some cases, the effects could be off, too But it’s the wrong model. Use a probit or logit, period

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Well it isn't quite the wrong model given the outcomes aren't close to the boundaries.

#### noetsi

##### No cake for spunky
Do you mean the probabilities are not near 0 or 1 on average?

Its the wrong model even then, the SE are messed up and the relationship is inherently non-linear - but the model won't show you that.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
That paper also tells you how to run it using OLS.

#### noetsi

##### No cake for spunky
That paper also tells you how to run it using OLS.
I missed where it does that. I will have to go back and read it again.

This is an interesting article hlsmith. Some big names here (one of whom is a big LPM) fan.
Microsoft PowerPoint - Better Predicted Probabilities.pptx (stata.com)

in SAS is this the correct way to specify a LPM?

PROC GENMOD DATA=WORK.SORTTempTableSorted

PLOTS(ONLY)=None descend

;

CLASS...

Model....