Logistic Regression predicted probability is either 1 or 0 (or literally 2.2204E-16)

Rizer

New Member
#1
I am doing a test logistic regression to predict whether employees will stay in the company for more than 3 years.

After the model is trained, the predictions done using the model gives only the probabilities of "1" and "2.2204E-16 (essentially 0)".

I thought normally the probabilities will lies somewhere between 0 and 1. Is this case due to the lack of training data? Or model convergence problem? Are there ways to solve this problem?
 

Dason

Ambassador to the humans
#2
Re: Logistic Regression predicted probability is either 1 or 0 (or literally 2.2204E-

How many predictors are you using? What are your sample sizes?
 

Rizer

New Member
#3
Re: Logistic Regression predicted probability is either 1 or 0 (or literally 2.2204E-

How many predictors are you using? What are your sample sizes?
I have 533 predictors, and 18000 pieces of training data.

The training phase sometimes gives 2 warnings:
"Iteration limit is reached"
"Regression design matrix is rank deficient to within machine precision"

Would these cause the "1" and "0" probabilities predicted?
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
Re: Logistic Regression predicted probability is either 1 or 0 (or literally 2.2204E-

Can you post your full output?
 

Rizer

New Member
#5
Re: Logistic Regression predicted probability is either 1 or 0 (or literally 2.2204E-

Can you post your full output?
Thanks for helping :)

I used the Matlab function "fitglm" to implement the logistic regression by setting the 'Distribution' parameter equals to 'binomial' :

Logi_COE_P = fitglm(training_data_matrix, result_data_matrix, 'linear', 'CategoricalVars', CategorialVariables, 'Distribution', 'binomial', 'Link', 'logit', 'BinomialSize', 1, 'DispersionFlag', true, 'Weights', OverllDataWeight);


During the training process, it gives the warnings:

Warning: Removing terms where categorical variables
appear in powers higher than linear.
> In FormulaProcessor>FormulaProcessor.removeCategoricalPowers at 510
In TermsRegression>TermsRegression.removeCategoricalPowers at 396
In GeneralizedLinearModel>GeneralizedLinearModel.fit at 1244
In fitglm at 133
In Forecast at 248
Warning: Iteration limit reached.
> In glmfit at 368
In GeneralizedLinearModel>GeneralizedLinearModel.fitter at 919
In FitObject>FitObject.doFit at 220
In GeneralizedLinearModel>GeneralizedLinearModel.fit at 1245
In fitglm at 133
In Forecast at 248
Warning: Regression design matrix is rank deficient
to within machine precision.
> In TermsRegression>TermsRegression.checkDesignRank at 98
In GeneralizedLinearModel>GeneralizedLinearModel.fit at 1262
In fitglm at 133
In Forecast at 248


For the predictions given by the trained model, it gives:

Probability of employee staying more than 3 years: 1 1 1 1 1 1 2.22E-16 2.22E-16 2.22E-16 2.22E-16 2.22E-16 2.22E-16 2.22E-16 2.22E-16
Employee number: 1 2 3 4 5 6 7 8 9 10 11 12 13 14


Any idea what all these mean?
 

hlsmith

Less is more. Stay pure. Stay poor.
#6
Re: Logistic Regression predicted probability is either 1 or 0 (or literally 2.2204E-

No idea, not familiar enough with STATA or the procedure. I would consult the documentation for the procedure. Is this a crossvalidation procedure? Some times you can change the number of iterations in programs, but you seem to have other issues as well.
 

Dason

Ambassador to the humans
#7
Re: Logistic Regression predicted probability is either 1 or 0 (or literally 2.2204E-

Why do you have some many predictors?
 

Rizer

New Member
#8
Re: Logistic Regression predicted probability is either 1 or 0 (or literally 2.2204E-

Thanks for the effort :)

Ya there are a few issues, not sure which cause the unwanted results...
 

Rizer

New Member
#9
Re: Logistic Regression predicted probability is either 1 or 0 (or literally 2.2204E-

Why do you have some many predictors?
My thought was that I can start with many possibly meaningful predictors, then those non-meaningful ones will be fitted with close to 0 coefficients, or with high p-value as the fit results.

Would 18000 pieces of training data normally be enough for 500-ish predictors?
 

hlsmith

Less is more. Stay pure. Stay poor.
#10
Re: Logistic Regression predicted probability is either 1 or 0 (or literally 2.2204E-

Of the 180000 observations how many have the outcome of interest? The general rule is the you take the smaller proportion group of the outcome (e.g., 50%, so 9,000) and you my be able to support a predictor for each 10-20 values in that group (so 450 to 900).

Though big picture you seem to be fishing for results instead of making advances base on prior knowledge. You should work on building the model up. Can you get you model to run with a few predictors?
 
#11
Re: Logistic Regression predicted probability is either 1 or 0 (or literally 2.2204E-

The observations with desired outcome is about 1/10 of the sample size.

By taking a smaller proportion group of the outcome, do you mean I should pick a portion which contains similar number of desired and undesired outcome?

I think you are right. I shouldn't be fishing for results and should try to use a few predictors first, then improve upon that.

Thanks for you suggestions :)
 

hlsmith

Less is more. Stay pure. Stay poor.
#12
Re: Logistic Regression predicted probability is either 1 or 0 (or literally 2.2204E-

So if you had 18000 observation, with 1800 1s and 16200 0s then you may be powered for 90 to 180 predictors. That is a pseudo generality.