At this point I would like to investigate if disease X is independently associated with disease Y, even after controlling for some other clinical variables. My tutor recommended the use of logistic regression for this procedure.

My thinking to this point is that in order to control for other variables I have to force them into the logistic regression model. If I use forward or backward selection (or if I experiment with removing variables based on clinical judgement to try to find the best model) some variables will be excluded and because of this I can not claim to control for them in the final model. Can I control for a variable even if it is not in the final model? I have looked on the internet and in a couple of text books for two days straight to find an answer to this question.

Process for figuring out the variables I want to control for:

Decided which variables that could plausibly be associated with X or Y

Significance testing on these associations with crosstabs and t test

Picked a rough list of variables with p<0.150

Excluded some variables because of too many missing values compared to N

Excluded some variables because of suspected data collection errors

Excluded three variables that are part of the definition of Y

- I don't know if this is correct. Is it?

Variables with obvious collinearity were also grouped and the most representative one was picked and the others excluded to avoid multicollinearity.

----> The final list that I want to force into the logistic regression model together with disease X and age and sex

Anything seems fishy?