Unbalanced confounding variable

#1
Apologies if this has been asked before, but I could not find anything relevant.

I have created a physiological score (continuous variable, values between 0 and 1) that correlates to a disease (binary variable, 0 = healthy, 1 = patient). The idea is to use this score to predict the disease (since it is measured relatively easily).

The dataset that I have available, however, is unbalanced with regards to age (healthy people are in average younger than patients). We also know that age plays some role in this disease (indirectly).

I am trying to detect if this bias in age renders my analysis problematic. In particular, I need to be sure that the good correlation between the score and the disease is not due to an imbalance in age between the two groups.

For this reason, I run two independent GLMs (binary, with logit link):

Disease ~ age
Disease ~ score + age

According to the results, adding the score variable largely reduced the residuals (72 from 110 for the model with age only). Furthermore, both variables have coefficients that are statistically significant.

Is this enough to show that, even correcting for age, the physiological score has some explanatory power for the disease?
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
Tell us more about your sample, this line opens the door to a lot of questions, "however, is unbalanced with regards to age (healthy people are in average younger than patients)"

"(indirectly)", so do you mean independent or you think they both are almost independent?

Where does the score come from and how was it developed?

If score was binary you could add weights to the model, but since it is continuous this gets tricky. Please post two histograms for age, one for the disease and one for undiseased. Stack them or overlay them if possible, so we can see their overlap.
 
#3
Thank you hlsmith for the reply! Here is the relevant info:

- About the sample: We recruited patients from our clinic, and also measured healthy volunteers for comparison. Due to the nature of the recruitment, most patients were old. It was, however, very hard to get healthy volunteers that were completely age-matched with the patients. Thus the imbalance. I understand that this is a study design issue, I can however only work with what I have.

- By "indirectly" I mean that age has no direct effect on what I am measuring (i.e. the score does not increase/drop with age). But due to a number of secondary factors, older people could have a lower score.

- Unfortunately I cannot say a lot about the score at this point. It is derived from physiological parameters that are medically known to be affected by the disease.

- Here is a histogram of age with regards to disease:

fig.png
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
This is a little sketchy, "- By "indirectly" I mean that age has no direct effect on what I am measuring (i.e. the score does not increase/drop with age). But due to a number of secondary factors, older people could have a lower score."

So this means it does affect the score, just indirectly, which is important to know and pretty much means it affects the score.

These relationships needs to be mapped out. If age was completely independent of Score, and associated with disease status, adding it should not change the score estimate and possibly make its standard errors smaller. Does the incoporation of age change the score's coefficient?

In either scenario, age being independent or on a backdoor path it should then be in the model.

However, you mottle things more when saying that components of the score are effects of disease. So you are trying to reverse predict disease. This places score at least partially occurring after disease. If Score was confounded by age and caused disease you would just fit Disease ~ age score, but given score is a partial cause of disease and I currently don't know if it is a composite that also includes components that cause disease, which would create more mayhem.

Tell me if the relationship looks like this or not???

1642801574873.png
 
#5
I apologize, I probably have difficulty properly explaining the matter at hand.

In my mind, age is a distal variable, in the sense that it does not directly affect anything. As a parallel, altitude does not directly affect ambient temperature, even though higher altitudes are generally characterized by lower temperatures; a number of other factors (that just so happen to also correlate with altitude) actually affect ambient temperature. However, seeing your graph, I think that (since we are not talking about causality) you can simplify it and drop the "secondary factors".

There is, however, another catch; some of the components that make up the score are affected by age, while others are not.

I tried revising your graph to illustrate these points. I hope this helps!
 

Attachments