So many variables--how to approach this?

brooklynqueen

New Member
I'm working on a project about depression as a possible side effect of a particular medication.

The people who take this drug also tend to have other lifestyle habits associated with depression, so I'm trying to figure out whether anything is confounding the relationship between the medication and depression.

I have 100 data points (the medication takers) and 60 variables to work with (exercise habits, drinking habits, a bunch of specific foods and the frequency of their consumption, relationship status, age, etc.). Yes, 60! Way too many, and plenty are probably irrelevant.

Right now, I have all the linear correlations calculated between each independent variable and the medication, as well as each independent variable and depression. About 20 are significantly associated with both the medication and with depression.

My question is, what's the best way to go about modeling this? Should I run multiple regressions with each independent variable (one at a time) plus the medication, with "depression" as the dependent variable, and see which correlations remain strong? And then build another model using whichever variables (including the medication) stayed significant in the first round of regressions? Or is there a better approach?

Sorry if this comes across as amateurish--I'm very new to statistics.

Dason

Ambassador to the humans
I have 100 data points (the medication takers) and 60 variables to work with ...
A general rule of thumb for the maximum number of variables to include in a model. You typically want at least 10-20 observations for each variable so you really should only consider models that have 5-10 predictors as a maximum.
My question is, what's the best way to go about modeling this? Should I run multiple regressions with each independent variable (one at a time) plus the medication, with "depression" as the dependent variable, and see which correlations remain strong? And then build another model using whichever variables (including the medication) stayed significant in the first round of regressions? Or is there a better approach?
This isn't really a good method to choose variables. You most likely will have correlated predictors and you don't really want too many correlated variables as predictors in a model. There are better ways to choose a model in a stepwise procedure. This sort of explains it but doesn't do a very good job. Basically with forward selection you do what you were going to do and build a single predictor model for each of the variables and find which one does the best job. Then keep that variable and then try every two variable models where one of them is the one you decided to keep from the first round. You keep doing this until you don't get any significant gains from adding any more variables into the model.

Are you using software for the analysis? Most software will do some sort of stepwise model selection procedure for you.

brooklynqueen

New Member
Thanks a ton Dason.

I'm using a program called SYSTAT. I just found the feature where you can use a stepwise model selection--perfect!

I tried running it and it gave me a model that doesn't even include the medication. This strikes me as odd, because when I tried making models with various combinations of two independent variables (the medication + something else), the medication always came out with a much higher coefficient and a much lower p-value than the other variable. Does this mean the stepwise model is no good, or is my understanding of statistics just deficient?

Thanks again.

Lazar

Phineas Packard
Thanks a ton Dason.

I'm using a program called SYSTAT. I just found the feature where you can use a stepwise model selection--perfect!

I tried running it and it gave me a model that doesn't even include the medication. This strikes me as odd, because when I tried making models with various combinations of two independent variables (the medication + something else), the medication always came out with a much higher coefficient and a much lower p-value than the other variable. Does this mean the stepwise model is no good, or is my understanding of statistics just deficient?

Thanks again.
CAUTION. Be very careful with computer generated stepwise models (particularly in your case with many IVs - In fact I would generally suggest you do not use it)! It is far better to specify the model yourself. In reality I think you need to think far more about what you are really interested in and be more precise about your model. I would suggest you start by reading Cohen et al.s book applied multiple regression/correlation analysis for the behavioural sciences. Start with chapter 5 to see problems with computer generated stepwise models (p. 161 onward).

Berley

Member
Wouldn't a factor analysis be appropriate here? You undoubtedly are going to find a lot of overlap between variables.

taylormk

New Member
I too have had this problem of which predictors to include. I have read many times that stepwise procedures are not a good way to make these decisions. I can't explain exactly why but it is usually suggested that stepwise proceedures be used for "exploratory" analysis when you don't have a specific hypothesis (i.e you have no clue which predictors may be important). You could try adding one predictor at a time starting with the one you are most interested in or the one that previous research has suggested is most important. Then built a new model with one more predictor added, then a 3rd model with two more predcitors added etc.. At each step you can look at the Rsquare change statistic that should be available in your stats program. With the addition of each new predictor, you look to see is Rsquare change is significant.

Another important point is that you should look at correlations among your predictors and be aware of the strength of these correlations during your modeling. The classic problem with correlated predictors is "suppressor effects". For example, if you model kids heights from school and inlcuded the length of their right leg as a predictor, you would find highly significant results. But then if you add the length of their left leg in the model (i.e. two predictors: right leg AND left leg), you would suddenly find that both right and left legs predictors are not significant anymore. This is becasue right and left legs share the exact same information (i.e they are 100% correlated). So from this exercise you would walk away thinking that "leg length" is not a good predictor of height which is NOT true. Really, you just overfitted the model.

Anyways, the topic you asked about is one of the hardest things about modeling and I certainly don't have the answers, but I have learned a few important things along the way.

hope this helps
taylormk

Dason

Ambassador to the humans
I can't explain exactly why but it is usually suggested that stepwise proceedures be used for "exploratory" analysis when you don't have a specific hypothesis (i.e you have no clue which predictors may be important).
True, you should keep variables you feel are important in the model. You should also keep variables of interest in the model.
At each step you can look at the Rsquare change statistic that should be available in your stats program. With the addition of each new predictor, you look to see is Rsquare change is significant.
There are a lot of things you could look at and Rsquare is my least recommended. If anything look at adjusted Rsquare but even that isn't very good (in my mind) compared to something like AIC or BIC or even Mallow's C.

This is because right and left legs share the exact same information (i.e they are 100% correlated).
This is just me being pedantic but they don't have to be 100% correlated. People can have one leg slightly longer than the other.

Anyways, the topic you asked about is one of the hardest things about modeling and I certainly don't have the answers, but I have learned a few important things along the way.
It can be tough and the process really depends on what exactly you're trying to do. Are you trying to make a decision about a certain predictor? Are you just trying to build a model that explains the data you have in hand? Are you attempting to build a model that will give good predictive qualities? The answers to these questions should change how you go about the process and it can be tough to address all of these concerns (especially just through a forum) without more information.

brooklynqueen

New Member
Lazar--thank you, I'll check out that chapter.

Berley--I found the "factor analysis" option on the software I'm using, but I have no idea how to interpret the numbers it gives me. I see "factor pattern," "communality estimates," "specific variances," and "latent roots"... which of these should I be looking at? Thanks!

taylormk--I'm glad I'm not the only one with this problem! I think I will go with your strategy--I'll start with just the medication as the independent/depression as the dependent and add the other (logical) variables one by one. Some of the variables are redundant or obviously not truly causative of depression (ie, shoe size) so I won't bother even dealing with those, which should save some time.

Quick question (to anyone). If, in all the models I build, medication still keeps a fairly high coefficient with a low p-value, and still adds unique variance no matter what combinations of variables I run, can I be confident that the medication itself is contributing to depression? (Sorry if this is an obvious question; I just want to make sure I'm not missing something.)

And one more slightly longer question for anyone who has time:

Say I have my initial medication/depression model and it produces this:

beta (for medication) = 0.60, p<0.001

And then I add a new variable (anxiety) that is associated with the medication but more strongly associated with depression, and it does this:

beta (for medication) = 0.40, p<0.01
beta (for new variable) = 0.40, p<0.01

It significantly increases r^2 as well. I'm trying to understand how to interpret this. Does it mean anxiety was partially confounding the relationship between the medication and depression?

Thank you for all the help everyone. I'm learning a lot.

Lazar

Phineas Packard
I too have had this problem of which predictors to include. I have read many times that stepwise procedures are not a good way to make these decisions. I can't explain exactly why but it is usually suggested that stepwise proceedures be used for "exploratory" analysis when you don't have a specific hypothesis (i.e you have no clue which predictors may be important).
Yes this is generally the advice but in this case even that would not be justified given the number of variables in relation to the sample size. In this case stepwise is pretty much inappropriate and I would not trust the results.

You could try adding one predictor at a time starting with the one you are most interested in or the one that previous research has suggested is most important. Then built a new model with one more predictor added, then a 3rd model with two more predcitors added etc.. At each step you can look at the Rsquare change statistic that should be available in your stats program. With the addition of each new predictor, you look to see is Rsquare change is significant.
Stepwise aims to explain the most amount of variance with the minimum amount of predictors. However, this relies heavily on chance and particularly when there are many predictors stepwise procedures can produce results that are unlikely to be reproduced in another sample and worse still has little if any relationship to the population of interest.

Another important point is that you should look at correlations among your predictors and be aware of the strength of these correlations during your modeling.
Excellent point. I hate to say it but there is simply no substitute for knowing your data inside out (part of which involves exploring correlations but much more is needed). More importantly, there is no substitute for a strong theory and an appropriate design.

The classic problem with correlated predictors is "suppressor effects". For example, if you model kids heights from school and inlcuded the length of their right leg as a predictor, you would find highly significant results. But then if you add the length of their left leg in the model (i.e. two predictors: right leg AND left leg), you would suddenly find that both right and left legs predictors are not significant anymore. This is becasue right and left legs share the exact same information (i.e they are 100% correlated). So from this exercise you would walk away thinking that "leg length" is not a good predictor of height which is NOT true. Really, you just overfitted the model.
Suppression is actually different (see here for a rough overview). Again Cohen et al gives a good overview of suppression and how to identify it.

Mean Joe

TS Contributor
Say I have my initial medication/depression model and it produces this:

beta (for medication) = 0.60, p<0.001

And then I add a new variable (anxiety) that is associated with the medication but more strongly associated with depression, and it does this:

beta (for medication) = 0.40, p<0.01
beta (for new variable) = 0.40, p<0.01

It significantly increases r^2 as well. I'm trying to understand how to interpret this. Does it mean anxiety was partially confounding the relationship between the medication and depression?
In your initial model, the result says that people who take medication will be "0.60 more depressed" than people who do not, on average. The problem with this model, is that other things affect depression: you cannot simply look at your population as people who take medication vs those who do not.

In the 2nd model (let's say that new variable is age ie old/young dichotomy), the result says that people who take medication are "0.40 more depressed" on average than people who do not. And moreover, people who are old and take medication are "0.80 more depressed" than peole who are young and not taking medication. So this model is giving more information about the relationship of medication and depression (initial model just gave one number, to apply to everyone; 2nd model gives a number to apply to one value of the new variable, and another number to apply to another value of the new variable).

You mention a significant increase in r^2; I think in your situation you can ignore r^2. You should not try to totally explain all the variance in depression, based on your "small" sample. It would be unwise to say that you found 20 new factors for depression, in your n=100 sample of people taking medication.

This brings me to the topic of which variables to include in your model. You have interesting findings; but I think since you have a small sample, you should be sure to present one model that sticks to including variables that other papers have mentioned are factors to consider for depression. The community would be interested to know if weight/gender/recent medical history/family history are effects to consider for depression with this medication.

If you have a finding that size 18's (talking about shoes) are more likely to be depressed on this medication, I would say that is a factor for further research (with a larger sample size), but you should stay away from including in the main model. Basically, I'm thinking of staying away from variables that only a handful of people in your sample possess.