Logistic regression with dependent variables


New Member
Dear All

I'm trying to perform logistic regression to determine risk factors for retinopathy of prematurity (ROP). The outcome is binary 1= required treatment 0 = did not require treatment.

Two important continuous input variables are gestational age and birthweight.

The more premature you are (lower gestational age) the more likely you are to get ROP.

The less your birthweight, the more likely to get ROP.

But the more premature you are, the more likely you are to have a low birthweight.

Does the logistic regression in SPSS correct for this or are my correlation coefficients going to be spurious?

I'd also like to add in variables such as sex and ethnicity but birthweight would also be dependent on these.

The only way to correct birthweight for age and sex is to calculate the child's centile. Thus a child on the 10th centile is at the top of the lowest 10% for their age, a child on the 50th centile is average and the 90th centile is at the bottom of the top 10% for their age. The problem with this is that centile and gestational age does not seem to give as strong or good a prediction as simply using birthweight and gestational age (the standard way of doing things).

In the end, what I want to do is work out the probability of requiring treatment given different combinations of the predictor variables.

Any help would be greatly appreciated!

Best Wishes

I am not going to be able to answer a question about a specific output of a specific routine in SPSS, but I can tell you this about multiple regression in general: yes, it correctly accounts for multi-colinearity effects.

A two-factor logistic regression assumes

\({\rm logit}(P) = a + b_1 x_1 + b_2 x_2\)

In your example, x_1 might be gestational age and x_2 birthweight. The output of the logistic regression should be best-fit values for a, b_1, and b_2, along with a covariance matrix between them (and from that covariance matrix you can derive error bars on the three parameters).

Suppose gestational age and birthweight are highly correlated, and both are inversely correlated with the incidence of this pathology. This if you did a single-factor logistic regression on either x_1 or x_2, you would expect the b-coefficent to be significantly negative in either case.

Now suppose, furthermore, that, if you control for gestational age, then low birthweight actually protects against the pathology. That is, for a given gestational age, you are actually less likely to get the pathology if your birthweight is low. We just didn't see that effect in our single-factor regression on x_2, because so many of the low-birthweight babies also had low gestational age, and the pure gestational age effect swamped the pure birthweight effect. Then the two-factor logistic regression will correctly see this effect: the b_1-coefficient will be negative, but the b_2-coefficient will be positive.

The one thing that confuses me about your question is that you pose it in terms of "correlation". The regression coefficient is related to correlation in the single-factor case, but not in the multi-factor case. The corrleation between birthweight and this pathology is what it is, and looking at any number of additional factors will never change that number or "correct" it for multi-colinearity effects. But the coefficients in a multiple regression analysis will change as you add information on additional factors, in a way that correctly accounts for multi-colinearity effects that were hidden with the additonal factors were not observed.


New Member
Thank you Ichbin.

I was hoping that multiple regression would automatically take care of collinearity between variables I was just confused when I read online that an assumption of multiple regression was that variables were independent.

You have helped me greatly. As far as I understand it, I can build my model with as many variables as I like, and the model produced will correctly factor each model into the equation (although some variables may contribute very little if they have no effect).

That is very helpful!

Thank you!



New Member
I have digested Ichbin's helpful information.

So here is my final question.

It is known that low birthweight, female sex, and young gestational age are all correlated with the risk of needing treatment for ROP.

Therefore, if I run the logistic regression with these 3 variables, I get the following B values:

Variables B sig.

Birthweight -4.2 0.002
Sex -.371 .373 (note 1=male, 0=female)
Gestationalage -.295 .037

Now, ignoring for now that sex is not statistically significant (I don't have enough cases at present) it would seem that the above conclusions are supported by my data.

What I want to do now, is ask whether percentile is in itself an additional risk factor. In other words, if you are small for your age compared with your peers, are you at even greater risk.

Now, what I wonder is this. Has the logistic regression I have already done, already answered this 'new' question. In other words, if birthweight is a risk factor controlling for age and sex, then it follows that being small for your age is a risk factor.

Percentile is calculated by plotting the child's birthweight on a chart that has the gestational age and sex of the baby to give you the child's percentile based on many hundreds of observations of children of the same sex and age. Therefore, given that it is plotted from the other 3 variables, is it then nonsensical to add percentile to the other 3 variables in my regression?

If I do add it in I get the following coefficents:

Variable B sig

Birthweight -3.6 0.169
Sex -.405 0.354
Gestationalage -.350 0.170
Percentile -.517 0.792

The first thing is that none of my coefficients are statistically significant. But if we ignore this (hoping that more data would make them significant) the logistic regression seems to be saying that if you hold birthweight, sex and gestational age constant, then a lower percentile is an independent risk factor.

The problem is this: if you hold birthweight, sex and gestational age constant - you can only have one percentile value!!

So have I just created a nonsense???

Any help would be appreciated. I hope I don't sound nuts here.

Best Wishes

Last edited:
Hi Simon. Yes, you have just created non-sense. And for precisely the reason that you stated: there is no way to "vary the percentile" while holding birthweight, sex, and gestational age constant, but that is what you are asking your multi-variate regression to do when you ask it to give you seperate regression coefficients for each of these variables.

You are also seeing here how a multi-variate regression responds to a strong multi-colinearity effect. In order to control for the other variables, it is trying to find some points where all those other variables are similiar, but the percentile variable is different. But it can't find any such data points. Since it can always compensate for a change in one of the variables by a corresponding change in the other (which it doesn't know isn't actually independently tune-able), it ends up assigning very wide error bars to the coefficients of all the involved variables. If you look at the off-diagonal covariance matrix entries for these coefficients, you will see they are very strongly correlated, meaning if you would pin down the coefficient for one of your variables that would have a strong effect on the coefficients for the others.