Adjusting for clustered observations


I have collected mortality data and several covariates (data on treatment and vital parameters) from patients in two different hospitals. My goal is to analyse effects of treatments on mortality while adjusting for covariates, and using logistic regression would be my first choice to accomplish this goal.

Since patients within each hospital are not independent from each other and since there may be a systematic difference between the treatment in the two hospitals I would like to adjust my analysis for the fact that observations are clustered in hospitals.

I have specified different models:
1. Logistic regression including hospital as fixed effect covariate (there are only two hospitals and I was also interested in the effect of hospital on mortality)
2. Random intercept logistic model using hospital as random effect and the other covariates as fixed.
3. Generalized Estimating Equation with logistic link function, assuming an exchangeable correlation structure and using robust standard errors to adjust for clustered observations.

Although I would say that all models should adequately address the issue of clustered observations (and in fact should provide similar results), the results are very different. The mixed model and the GEE (models 2 and 3) produce similar results, but the first model is very different. In the one case, a certain treatment does influence mortality, whereas it does not in the other. The same is true for the significance of covariates.

What model would you consider most adequate to adjust for the fact that patients were treated in two different hospitals? Does anyone have an idea why treating hospital as fixed effect makes such a big difference versus the random effects approach? Shouldn't it be very similar when there are only two hospitals??



Less is more. Stay pure. Stay poor.
Well, I will start off by applauding your awareness of a potential issue and your approaches. I am not a master of multilevel models by any stretch - I have only self-taught myself.

I believe the first step you are supposed to do, is run an empty model with no predictors, but controlling for random effects (intercepts in your case). If that model predicts a significant amount of variance, then you control for clustering n the iteratively built models.

The reason your first model is different comes down to the amount of variance you are neglecting to measure - the between hospital effects. So model two explains outcome by within and also between facility variance and the last model has robust errors, so it will also make finding and effect difficult, since you are comparably broadening your confidence intervals.

Literature says, not addressing clustering can lead to Type I errors, saying there is a difference when there isn't. I would examine how much of the covariance you can explain controlling for clustering. I recall when I ran my first multilevel model, I thought it would shrink up my confidence intervals and I would find significance easier. You are on the right path, just think about meta-analyses. You can't just pool the results from two studies together, you have to control for study differences. Much like you may have to control for hospital differences.
Hi hlsmith,

thanks for your answer. As you suggested I ran the empty random intercepts model and indeed, there was significant variance.

I'm not quite sure why the fixed effects model neglects between hospital variance. There is a significant effect of hospital on mortality when treating hospital as fixed effect, and I would assume that this difference is what you mean with between hospital variance? If so, shouldn't controlling for this difference by including hospital as fixed effect in the model also control for between hospital variance?



Less is more. Stay pure. Stay poor.
Yeah, I get why you make this point. As I mentioned, I am not overly versed in this area - but trying to understand like yourself. The difference comes in the model, it has an extra term that controls for the random intercepts. It is something like the following:

y = Bo + a + B + e

where Bo represents the random intercepts. So if they have different intercepts then I believe that means they have a different probability of y. You are controlling for these different probabilities regardless of model derived slopes. I am doing a horrific job describing it, but if you think about lines on a graph you are saying hey because of this variable they have unique, so different probabilities for clusters to control for. Sorry if I am confusing you.
Don't worry about confusing me, I greatly appreciate your answers and I hope you can also share your thoughts on the following:

I get your point of the extra term that represents the random intercept. However, if I enter hospital as a fixed effect, both hospitals also have a different intercept. Consider the following regression model:

y= b0 + Xib1 + Xb2 + e

where b0 is the intercept
Xi = 1 for hospital 1, and 2 for hospital 2
b1 = fixed effects regression coefficient for hospital
Xb2 = some other covariate(s) and regression coefficient for that covariate(s).

Now, hospital 1 has the intercept b0+b1, and hospital 2 has intercept b0+2b1. As in a mixed (random intercepts) model, each hospital has its own intercept, only with the difference that the intercepts are determined by the fixed regression coefficients in the one model and a random function around the intercept in the other model. But shouldn't they both similarly control for between-hospital differences?